SynSetExpan: An Iterative Framework for Joint Entity Set Expansion and Synonym Discovery

Entity set expansion and synonym discovery are two critical NLP tasks. Previous studies accomplish them separately, without exploring their interdependencies. In this work, we hypothesize that these two tasks are tightly coupled because two synonymous entities tend to have similar likelihoods of belonging to various semantic classes. This motivates us to design SynSetExpan, a novel framework that enables two tasks to mutually enhance each other. SynSetExpan uses a synonym discovery model to include popular entities' infrequent synonyms into the set, which boosts the set expansion recall. Meanwhile, the set expansion model, being able to determine whether an entity belongs to a semantic class, can generate pseudo training data to fine-tune the synonym discovery model towards better accuracy. To facilitate the research on studying the interplays of these two tasks, we create the first large-scale Synonym-Enhanced Set Expansion (SE2) dataset via crowdsourcing. Extensive experiments on the SE2 dataset and previous benchmarks demonstrate the effectiveness of SynSetExpan for both entity set expansion and synonym discovery tasks.


Introduction
Entity set expansion (ESE) aims to expand a small set of seed entities (e.g., {"United States", "Canada"}) into a larger set of entities that belong to the same semantic class (i.e., Country). Entity synonym discovery (ESD) intends to group all terms in a vocabulary that refer to the same realworld entity (e.g., "America" and "USA" refer to the same country) into a synonym set (hence called a synset). Those discovered entities and synsets include rich knowledge and can benefit many downstream applications such as semantic search (Xiong * Equal Contributions.

SynSetExpan Framework
Illinois IL   (Shen et al., 2018a), and online education (Yu et al., 2019a). Previous studies regard ESE and ESD as two independent tasks. Many ESE methods (Mamou et al., 2018b;Yan et al., 2019;Huang et al., 2020;Zhang et al., 2020;Zhu et al., 2020) are developed to iteratively select and add the most confident entities into the set. A core challenge for ESE is to find those infrequent long-tail entities in the target semantic class (e.g., "Lone Star State" in the class US_States) while filtering out false positive entities from other related classes (e.g., "Austin" and "Dallas" in the class City) as they will cause semantic shift to the set. Meanwhile, various ESD methods (Qu et al., 2017;Ustalov et al., 2017a;Wang et al., 2019;Shen et al., 2019) combine stringlevel features with embedding features to find a query term's synonyms from a given vocabulary or to cluster all vocabulary terms into synsets. A major challenge here is to combine those features with limited supervisions in a way that works for entities from all semantic classes. Another challenge is how to scale a ESD method to a large, extensive vocabulary that contains terms of varied qualities.

Land of Lincoln
To address the above challenges, we hypothesize that ESE and ESD are two tightly coupled tasks and can mutually enhance each other because two synonymous entities tend to have similar likelihoods of belonging to various semantic classes and vice versa. This hypothesis implies that (1) knowing the class membership of one entity enables us to infer the class membership of all its synonyms, and (2) two entities can be synonyms only if they belong to the same semantic class. For example, we may expand the US_States class from a seed set {"Illinois", "Texas", "California"}. An ESE model can find frequent state full names (e.g., "Wisconsin", "Connecticut") but may miss those infrequent entities (e.g., "Lone Star State" and "Golden State"). However, an ESD model may predict "Lone Star State" is the synonym of "Texas" and "Golden State" is synonymous to "California" and directly adds them into the expanded set, which shows synonym information help set expansion. Meanwhile, from the ESE model outputs, we may infer "Wisconsin", "WI" is a synonymous pair while "Connecticut", "SC" is not, and use them to fine-tune an ESD model on the fly. This relieves the burden of using one single ESD model for all semantic classes and improves the ESD model's inference efficiency because we refine the synonym search space from the entire vocabulary to only the ESE model outputs.
In this study, we propose SynSetExpan, a novel framework jointly conducting two tasks (c.f. Fig. 1). To better leverage the limited supervision signals in seeds, we design SynSetExpan as an iterative framework consisting of two components: (1) a ESE model that ranks entities based on their probabilities of belonging to the target semantic class, and (2) a ESD model that returns the probability of two entities being synonyms. In each iteration, we first apply the ESE model to obtain an entity rank list from which we derive a set of pseudo training data to fine-tune the ESD model. Then, we use this fine-tuned model to find synonyms of entities in the currently expanded set and adjust the above rank list. Finally, we add top-ranked entities in the adjusted rank list into the currently expanded set and start the next iteration. After the iterative process ends, we construct a synonym graph from the last iteration's output and extract entity synsets (including singleton synsets) as graph communities.
As previous ESE datasets are too small and contain no synonym information for evaluating our hypothesis, we create the first Synonym Enhanced Set Expansion (SE2) benchmark dataset via crowdsourcing. This new dataset 1 is one magnitude larger than previous benchmarks. It contains a corpus of the entire Wikipedia, a vocabulary of 1.5 million 1 http://bit.ly/SE2-dataset. terms, and 1200 seed queries from 60 semantic classes of 6 different types (e.g., Person, Location, Organization, etc.). Contributions. In summary, this study makes the following contributions: (1) we hypothesize that ESE and ESD can mutually enhance each other and propose a novel framework SynSetExpan to jointly conduct two tasks; (2) we construct a new large-scale dataset SE2 that supports fair comparison across different methods and facilitates future research on both tasks; and (3) we conduct extensive experiments to verify our hypothesis and show the effectiveness of SynSetExpan on both tasks.

Problem Formulation
We first introduce important concepts in this work, and then present our problem formulation. A term is a string (i.e., a word or a phrase) that refers to a real-world entity 2 . An entity synset is a set of terms that can be used to refer to the same realworld entity. For example, both "USA" and "America" can refer to the entity United States and thus compose an entity synset. We allow the singleton synset and a term may locate in multiple synsets due to its ambiguity. A semantic class is a set of entities that share a common characteristic and a vocabulary is a term list that can be either provided by users or derived from a corpus. Problem Formulation. Given (1) a text corpus D, (2) a vocabulary V derived from D, and (3) a seed set of user-provided entity synonym sets S 0 that belong to the same semantic class C, we aim to (1) select a subset of entities V C from V that all belong to C; and (2) clusters all terms in V C into entity synsets S V C where the union of all clusters is equal to V C . In other words, we expand the seed set S 0 into a more complete set of entity synsets S 0 ∪ S V C that belong to the same semantic class C. A concrete example is presented in Fig. 1.

The SynSetExpan Framework
In this study, we hypothesize that entity set expansion and synonym discovery are two tightly coupled tasks and can mutually enhance each other.
Hypothesis 1. Two synonymous entities tend to have similar likelihoods of belonging to various semantic classes and vice versa.
The above hypothesis has two implications. First, if two entities e i and e j are synonyms and 2 In this work, we use "term" and "entity" interchangeably.  Figure 2: Overview of one iteration in our proposed SynSetExpan framework. Starting from the current set E, we first run a set expansion model to obtain an entity rank list L se based on which we generate pseudo training data D pl to fine-tune a generic synonym discovery model M 0 . We then apply this fine-tuned model to get a new rank list L sy ; merge it with L se to obtain the final entity rank list, and add top ranked entities into the current set E. e i belongs to semantic class C, e j likely also belongs to class C even if it is currently outside C. This reveals how synonym information can help set expansion by directly introducing popular entities' infrequent synonyms into the set and thus increasing the expansion recall. The second implication is that if two entities are not from the same class C, then they are likely not synonyms. This shows how set expansion can help synonym discovery by restricting the synonym search space to set expansion outputs and generating additional training data to fine tune the synonym discovery model. At the beginning, when we only have limited seed information, this hypothesis may not be directly applicable as we do not have complete knowledge of either entity class memberships or entity synonyms. Therefore, we design our SynSetExpan as an iterative framework, shown in Fig. 2.
Framework Overview. Before the iterative process starts, we first learn a general synonym discovery model M 0 using distant supervision from a knowledge base (c.f. Sect. 3.1). Then, in each iteration, we learn a set expansion model based on the currently expanded set E (initialized as all entities in user-provided seed synsets S 0 ) and apply it to obtain a rank list of entities in V, denoted as L se (c.f. Sect. 3.2). Next, we generate pseudo training data from L se and use it to construct a new class-dependent synonym discovery model M c by fine-tuning M 0 . After that, for each entity in V, we apply M c to predict its probability of being the synonym of at least one entity in E and use such synonym information to adjust L se (c.f. Sect. 3.3). Finally, we add top-ranked entities in the adjusted rank list into the current set and start the next itera-tion. After the iterative process ends, we identify entity synsets from the final iteration's output using a graph-based clustering method (c.f. Sect. 3.4).

Proposed Synonym Discovery Model
Given a pair of entities, our synonym discovery model returns the probability that they are synonymous. We use two types of features for entity pairs 3 : (1) lexical features based on entity surface names (e.g., Jaro-Winkler similarity (Wang et al., 2019), token edit distance (Fei et al., 2019), etc), and (2) semantic features based on entity embeddings (e.g., cosine similarity between two entities' SkipGram embeddings). As these feature values have different scales, we use a tree-based boosting model XGBoost (Chen and Guestrin, 2016) to predict whether two entities are synonyms. Another advantage of XGBoost is that it is an additive model and supports incremental model fine-tuning. We will discuss how to use set expansion results to fine-tune a synonym discovery model in Sect. 3.2.
To learn the synonym discovery model, we first acquire distant supervision data by matching each term in the vocabulary V with the canonical name of one entity (with its unique ID) in a knowledge base (KB). If two terms are matched to the same entity in KB and their embedding similarity is larger than 0.5, we treat them as synonyms. To generate a non-synonymous term pair, we follow the same "mixture" sampling strategy proposed in (Shen et al., 2019), that is, 50% of negative pairs come from random sampling and the other 50% of negative pairs are those "hard" negatives which are required to share at least one token. Some concrete examples are shown in Fig. 2. Finally, based on such generated distant supervision data, we train our XGBoost-based synonym discovery model using binary cross entropy loss.

Proposed Set Expansion Model
Given a set of seed entities E 0 from a semantic class C, we aim to learn a set expansion model that can predict the probability of a new entity (term) e i ∈ V belonging to the same class C, i.e., P(e i ∈ C). We follow previous studies (Melamud et al., 2016;Mamou et al., 2018a) to represent each entity using a set of 6 embeddings learned on the given corpus D, including SkipGram, CBOW in word2vec (Mikolov et al., 2013), fastText (Bojanowski et al., 2016), SetExpander (Mamou et al., 2018b), JoSE (Meng et al., 2019) and averaged BERT contextualized embeddings (Devlin et al., 2019). Given the bag-of-embedding representation [f 1 (e i ), f 2 (e i ), . . . , f B (e i )] of entity e i and the seed set E 0 , we define the entity feature is the cosine similarity between two embedding vectors. One challenge of learning the set expansion model is the lack of supervision signals -we only have a few "positive" examples (i.e., entities belonging to the target class) and no "negative" examples. To solve this challenge, we observe that the size of target class is usually much smaller than the vocabulary size. This means if we randomly select one entity from the vocabulary, most likely it will not belong to the target semantic class. Therefore, we can construct a set of |E 0 |×K negative examples by random sampling. We also test selecting only entities that have a low embedding similarity with the entities in the current set. However, our experiment shows this restricted sampling does not improve the performance. Therefore, we choose to use the simple yet effective "random sampling" approach and refer to K as "negative sampling ratio". Given a total of |E 0 | × (K + 1) examples, we learn a SVM classifier g(·) based on the above defined entity features.
To further improve set expansion quality, we repeat the above process T times (i.e., randomly sample T different sets of |E 0 | × K negative examples for learning T separate classifiers {g t (·)}| T t=1 ) and construct an ensemble classifier. The final classifier predicts the probability of an entity e i belonging to the class C by averaging all individual classifiers' Algorithm 1: SynSetExpan Framework.
Input: A seed set S0, a vocabulary V, a knowledge base K, maximum iteration number max iter, maximum size of expanded set Z, and model hyper-parameters {K, T, N, H}.
Output: A complete set of entity synsets SV C .
1 Learn a general ESD model M0 using distant supervision in K; 2 E ← Union of all synsets in S0; 3 for iter from 1 to max iter do Add top Z max iter entities in the adjusted rank list into E; 9 Construct a synonym graph G based on final set E; 10 SV C ← Louvain(G); 11 Return SV C . outputs (i.e., P(e i ∈ C) = 1 T T t=1 g t (e i ). Finally, we rank all entities in the vocabulary based on their predicted probabilities.

Two Models' Mutual Enhancements
Set Expansion Enhanced Synonym Discovery.
In each iteration, we generate a set of pseudo training data D pl from the ESE model output L se , to fine-tune the general synonym discovery model M 0 . Specifically, we add an entity pair e x , e y into D pl with label 1, if they are among the top 100 entities in L se and M 0 (e x , e y ) ≥ 0.9. For each positive pair e x , e y , we generate N negative pairs by randomly selecting N/2 entities from L se whose set expansion output probabilities are less than 0.5 and pairing them with both e x and e y . The intuition is that those randomly selected entities likely come from different semantic classes with entity e x and e y , and thus based on our hypothesis, they are unlikely to be synonyms. After obtaining D pl , we fine-tune model M 0 by fitting H additional trees on D pl and incorporate them into the existing bag of trees in M 0 . We discuss the detailed choices of N and H in the experiment. Synonym Enhanced Set Expansion. Given a fine-tuned class-specific synonym discovery model M c , the current set E, we calculate a new score for each entity e i ∈ V as follows: sy-score(ei) = max{Mc(ei, ej)|ej ∈ E}. (1) The above score measures the probability that e i is the synonym of one entity in E. Based on Hypothesis 1, we know an entity with a large sy-score is likely belonging to the target class. Therefore, we use a multiplicative measure to combine this sy-score with set expansion model's original output P(e i ∈ C) as follows: final-score(ei) = P(ei ∈ C) × sy-score(ei). (2) An entity will have a large sy-score as long as it is the synonym of one single entity in E. Such a property is particularly important for capturing long-tail infrequent entities. For example, suppose we expand US_States class from a seed set {"Illinois", "IL", "Texas", "TX"}. The original set expansion model, biased toward popular entities, assigns a low score 0.57 to "Lone Star State" and a large score 0.78 to "Chicago". However, the synonym discovery model predicts that, over 99% probability, "Lone Star State" is the synonym of "Texas" and thus has a sy-score 0.99. Meanwhile, "Chicago" has no synonym in the seed set and thus has a low sy-score 0.01. As a result, the final score of "Lone Star State" is larger than that of "Chicago". Moreover, we emphasize that Eq. 2 uses synonym scores to enhance, not replace, set expansion scores. A correct entity e that has no synonym in current set E will indeed be ranked after other correct entities that have synonyms in E. However, this is not problematic because (1) all compared entities are correct, and (2) we will not remove e from final results because it still outscores other erroneous entities that have the same low sy-score as e but much lower set expansion scores.

Synonym Set Construction
After the iterative process ends, we have a synonym discovery model M c that predicts whether two entities are synonymous and an entity list E that includes entities from the same semantic class.
To further derive entity synsets, we first construct a weighted synonym graph G where each node n i represents one entity e i ∈ E and each edge (n i , n j ) with weight w ij indicates M c (e i , e j ) = w ij . Then, we apply the non-overlapping community detection algorithm Louvain (Blondel et al., 2008) to find all clusters in G and treat them as entity synsets. Note here we narrow the original full vocabulary V to set expansion model's final output E based on our hypothesis. We summarize our whole framework in Algorithm 1 and discuss its computational complexity in supplementary materials.

The SE2 Dataset
To verify our hypothesis and evaluate SynSetExpan framework, we need a dataset that contains a cor-  pus, a vocabulary with labeled synsets, a set of complete semantic classes, and a list of seed queries. However, to the best of our knowledge, there is no such a public benchmark 4 . Therefore, we build the first Synonym Enhanced Set Expansion (SE2) benchmark dataset in this study 5 .

Dataset Construction
We construct the SE2 dataset in four steps.
1. Corpus and Vocabulary Selection. We use Wikipedia 20171201 dump as our evaluation corpus as it contains a diverse set of semantic classes and enough context information for methods to discover those sets. We extract all noun phrases with frequency above 10 as our selected vocabulary.
2. Semantic Class Selection. We identify 60 major semantic classes based on the DBpedia-Entity v2 (Hasibi et al., 2017) and WikiTable (Bhagavatula et al., 2015) entities found in our corpus. These 60 classes cover 6 different entity types (e.g., Person, Location, Organization). As such generated classes may miss some correct entities, we enlarge each class via crowdsourcing in the following step.

Query Generation and Class Enrichment.
We first generate 20 queries for each semantic class. Then, we aggregate the top 100 results from all baseline methods (c.f. Sect. 5) and obtain 17,400 class, entity pairs. Next, we employ crowdworkers on Amazon Mechanical Turk to check all those pairs. Workers are asked to view one semantic class and six candidate entities, and to select all entities that belong to the given class. On average, workers spend 40 seconds on each task and are paid $0.1. All class, entity pairs are labeled by three workers independently and the inter-annotator agreement is 0.8204, measured by Fleiss's Kappa (k). Finally, we enrich each semantic class C j by adding the entity e i whose corresponding pair C j , e i is labeled "True" by at least two workers. 4. Synonym Set Curation. To construct synsets in each class, we first run all baseline methods to generate a candidate pool of possible synonymous term pairs. Then, we treat those pairs with both  terms mapped to the same entity in WikiData as positive pairs and ask two human annotators to label the remaining 7,625 pairs. The inter-annotator agreement is 0.8431, measured by Fleiss's Kappa. Then, we construct a synonym graph where each node is a term and each edge connects two synonymous terms. Finally, we extract all connected components in this graph and treat them as synsets.

Dataset Analysis
We analyze some properties of SE2 dataset from the following three aspects.
1. Semantic class size. The 60 semantic classes in our SE2 dataset consist on average 145 entities (with a minimum of 16 and a maximum of 864) for a total of 8697 entities. After we grouping these entities into synonym sets, these 60 classes consist of on average 118 synsets (with a minimum of 14 and a maximum of 800) for totally 7090 synsets. The average synset size is 1.258 and the maximum size of one synset is 11. 2. Set expansion difficulty of each class. We define the set expansion difficulty of each semantic class as follows: where Topk(e) represents the set of k most similar entities to entity e in the vocabulary. We set k to be 10,000 in this study. Intuitively, this metric calculates the average portion of entities in class C that cannot be easily found by another entity in the same class. As shown in Table 2, the most difficult classes are those Location classes 6 and the easiest ones are Facility classes.
3. Synonym discovery difficulty of each class. We continue to measure the difficulty of finding synonym pairs in each class. Specifically, we calculate two metrics: (1) Lexical difficulty defined as the average Jaro-Winkler distance between the surface names of two synonyms, and (2) Semantic difficulty defined as the average cosine distance between two synonymous entities' embeddings. Table 2 lists the results. We find Product classes have the largest lexical difficulty and Location classes have the largest semantic difficulty.

Entity Set Expansion
Datasets. We evaluate SynSetExpan on three public datasets. The first two are benchmark datasets widely used in previous studies (Shen et al., 2017;Yan et al., 2019;Zhang et al., 2020): (1) Wiki, which contains 8 semantic classes, 40 seed queries, and a subset of English Wikipedia articles, and (2) APR, which includes 3 semantic classes, 15 seed queries, and all news articles published by Associated Precess and Reuters in 2015. Note that these two datasets do not contain synonym information and are used primarily to evaluate our set expansion model performance. We decide not to augment these two datasets with additional synonym information (as we did in our SE2 dataset) in order to keep the integrity of two existing benchmarks. The third one is our proposed SE2 dataset which has 60 semantic classes, 1200 seed queries, and a corpus of 1.9 billion tokens. Clearly, our SE2 is of one magnitude larger than previous benchmarks and covers a wider range of semantic classes.   sets that are closely related to the target set and leverages them to guide the expansion process. (7) CGExpan (Zhang et al., 2020): Current state-ofthe-art method that generates the target set name by querying a pre-trained language model and utilizes generated names to expand the set. (8) SynSetExpan: Our proposed framework which jointly conducts two tasks and enables synonym information to help set expansion. (9) SynSetExpan-NoSYN: A variant of our proposed SynSetExpan framework without the synonym discovery model. All implementation details and hyperparameter choices are discussed in supplementary materials Section F.
Evaluation Metrics. We follow previous studies and evaluate our results using Mean Average Precision at different top K positions: MAP@K = 1 |Q| q∈Q AP K (L q , S q ), where Q is the set of all seed queries and for each query q, we use AP K (L q , S q ) to denote the traditional average precision at position K given a ranked list of entities L q and a ground-truth set S q . To compare the performance of multiple models, we conduct statistical significance test using the two-tailed paired t-test with 99% confidence level.
Experimental Results. We analyze the set expansion performance from the following aspects. 1. Overall Performance. Table 3 presents the overall set expansion results. We can see that SynSetExpan-NoSYN achieves comparable performances with the current state-of-the-art methods on Wiki and APR datasets 7 , and outperforms previous methods on SE2 dataset, which demonstrates the effectiveness of our set expansion model alone. Besides, by comparing SynSetExpan-NoSYN with SynSetExpan on SE2 dataset, we show that adding synonym information indeed helps set expansion. 7 We feel both CGExpan and our method have reached the performance limit on Wiki and APR as both datasets are relatively small and contain only a few coarse-grained classes.   2. Fine-grained Performance Analysis. To provide a detailed analysis on how SynSetExpan improves over SynSetExpan-NoSYN, we group semantic classes based on their types and calculate the ratio of classes on which SynSetExpan outperforms SynSetExpan-NoSYN. Table 4 shows the results and we can see that on most classes SynSetExpan is better than SynSetExpan-NoSYN, especially for the MAP@50 metric. In Table 5, we further analyze the ratio of seed set queries (out of total 1200 queries) on which one method achieves better or the same performance as the other method. We can see that SynSetExpan can win on the majority of queries, which further shows that SynSetExpan can effectively leverage synonym information to enhance set expansion.
3. Case Studies. Figure 3 shows some expanded semantic classes by SynSetExpan. We can see that the set expansion task benefits a lot from the synonym information. Take the semantic class NBA_Teams for example, we find "L.A. Lakers" (i.e., the synonym of "Los Angeles Lakers") as well as "St. Louis Hawks" (i.e., the former name of "Atlanta Hawks") and further use them to improve the   set expansion result. Moreover, by introducing synonyms, we can lower the rank of those erroneous entities (e.g., "LA Dodgers" and "NBA coach").

Synonym Discovery
Datasets. We evaluate SynSetExpan for synonym discovery task on two datasets: (1) SE2, which contains 60,186 synonym pairs ( Compared Methods. We compare following synonym discovery methods: (1) SVM: A classification method trained on given term pair features. We use the same feature set described in Sect. 3.1.
(2) XGBoost (  SynSetExpan-NoFT: A variant of SynSetExpan without using the model fine-tuning. More implementation details and hyper-parameter choices are discussed in supplementary materials Section G. Evaluation Metrics. As all compared methods output the probability of two input terms being synonyms, we first use two threshold-free metrics for evaluation -Average Precision (AP) and Area Under the ROC Curve (AUC). Second, we transform the output probability to a binary decision using threshold 0.5 and evaluate the model performance using standard F1 score.
Experimental Results. Table 6 shows the overall synonym discovery results. First, we can see that the SynSetExpan-NoFT model can outperform both XGB-S and XGB-E methods significantly, which shows the importance of using both types of features for predicting synonyms. Second, we find that SynSetExpan can further improve SynSetExpan-NoFT via model fine-tuning, which demonstrates that set expansion can help synonym discovery. Finally, we notice that our SynSetExpan framework, with the fine-tuning mechanism enabled, can achieve the best performance across all evaluation metrics. In Figure 4, we show some synsets discovered by SynSetExpan. We can see that SynSetExpan is able to detect different types of entity synsets across various semantic classes. Furthermore, we highlight those entities discovered only after model fine-tuning, and we can see clearly that with fine-tuning, our SynSetExpan framework can detect more accurate synsets.

Related Work
Entity Set Expansion. Entity set expansion can benefit many downstream applications such as question answering (Wang and Cohen, 2008), literature search (Shen et al., 2018b), and online education (Yu et al., 2019a). Traditional entity set expansion systems such as GoogleSet (Tong and Dean, 2008) and SEAL (Wang and Cohen, 2007) require seed-oriented online data extraction, which can be time-consuming and costly. Thus, more recent studies (Shen et al., 2017;Mamou et al., 2018b;Yu et al., 2019c;Huang et al., 2020;Zhang et al., 2020) are proposed to expand the seed set by offline processing a given corpus. These corpusbased methods include two general approaches: (1) one-time entity ranking (Pantel et al., 2009;He and Xin, 2011;Mamou et al., 2018b;Kushilevitz et al., 2020) which calculates all candidate entities' distributional similarities with seed entities and makes a one-time ranking without back and forth refinement, and (2) iterative bootstrapping (Rong et al., 2016;Shen et al., 2017;Huang et al., 2020;Zhang et al., 2020) which starts from seed entities to extract quality textual patterns; applies the extracted patterns to obtain more quality entities, and iterates this process until sufficient entities are discovered. In this work, in addition to just adding entities into the set, we go beyond one step and aim to organize those expanded entities into synonym sets. Furthermore, we show those detected synonym sets can in turn help to improve set expansion results. Synonym Discovery. Early efforts on synonym discovery focus on finding entity synonyms from structured or semi-structured data such as query logs (Ren and Cheng, 2015), web tables (He et al., 2016), and synonymy dictionaries (Ustalov et al., 2017b,a). In comparison, this work aims to develop a method to extract synonym sets directly from raw text corpus. Given a corpus and a term list, one can leverage surface string (Wang et al., 2019), co-occurrence statistics (Baroni and Bisi, 2004), textual pattern (Yahya et al., 2014), distributional similarity (Wang et al., 2015), or their combinations (Qu et al., 2017;Fei et al., 2019) to extract synonyms. These methods mostly find synonymous term pairs or a rank list of query entity's synonym, instead of entity synonym sets. Some studies propose to further cut-off the rank list into a set output (Ren and Cheng, 2015) or to build a synonym graph and then apply graph clustering tech-niques to derive synonym sets (Oliveira and Gomes, 2014;Ustalov et al., 2017b). However, they all operate directly on the entire input vocabulary which can be too extensive and noisy. Comparing to them, our approach can leverage the semantic class information detected from set expansion to enhance the synonym set discovery process.

Conclusions
This paper shows entity set expansion and synonym discovery are two tightly coupled tasks and can mutually enhance each other. We present SynSetExpan, a novel framework jointly conducting two tasks, and SE2 dataset, the first large-scale synonym-enhanced set expansion dataset. Extensive experiments on SE2 and several other benchmark datasets demonstrate the effectiveness of SynSetExpan on both tasks. In the future, we plan to study how we can apply SynSetExpan at the entity mention level for conducting contextualized synonym discovery and set expansion.

B SynSetExpan Framework Complexity
From the Algorithm 1 in the main text, we can see our SynSetExpan framework costs O(T × (1 + K) × |S| + |V|) for each iteration, where T is the ensemble times (usually 50), K is the negative sampling size (usually 10-20), S is the currently expanded set (usually of size < 100), and |V| is the vocabulary size. Although such complexity looks expensive, we can significantly reduce the practical running time in two ways. First, we can learn T separate classifiers in set expansion model in parallel. Second, we can aggregate all words in the vocabulary into one batch and apply synonym discovery model for inference in one run. We report the practical running time for each component in the below experiments.

C Existing ESE Datasets
An ideal set expansion benchmark dataset should contain four parts: a corpus, a vocabulary, a set of complete semantic classes, and a collection of seed queries for each semantic class. One of the earliest corpus-based set expansion work (Pantel et al., 2009) uses "List of " pages in Wikipedia to construct 50 semantic classes and applies random sampling to construct 30 queries for each class. Although those classes and queries are still available today, we have no access to its underlying corpus and vocabulary and thus cannot easily reproduce their results. Similarly, SEISA (He and Xin, 2011) and EgoSet (Rong et al., 2016) also release their constructed semantic classes and seed queries but hold the corpus and vocabulary. On the other side, SetExpander (Mamou et al., 2018b) and CaSE (Yu et al., 2019b) clearly describe their corpus and vocabulary but do not release their classes/queries. To the best of our knowledge, SetExpan (Shen et al., 2017) is the only public dataset consisting of all four essential components. However, it only contains 65 queries from 13 classes and has no synonym information. Below Table 8 compares our proposed SE2 with all existing datasets and we can see that our new dataset contains all four key parts for a set expansion benchmark dataset, as well as additional synonym information.

D SE2 Dataset Construction Details
We construct our dataset in four stages: (1) Corpus and vocabulary selection, (2) Semantic class selection, (3) Query generation and class enrichment, and (4) Synonym set curation.
Corpus and vocabulary selection. An ideal corpus for set expansion task should contain a diverse set of semantic classes and enough context information for methods to discover those sets. Based on these two criteria, we select Wikipedia 20171201 dump as our evaluation corpus. This corpus is also used in previous studies (Mamou et al., 2018b,a) and contains 1.9 billion tokens of raw size 14GB. Next, we extract all noun phrases with frequency above 10 and filter out those noun phrases that start with either a stopword (e.g., "a/an" and "the") or a non-word character (e.g., "(", and "-"). The remaining 1.47 million noun phrases consist of our vocabulary. Semantic class selection. To select a diverse set of semantic classes, we first use simple string matching to align our corpus and vocabulary with two benchmark datasets designed for tasks closely related to Set Expansion: (1) DBpedia-Entity v2 (Hasibi et al., 2017) for Entity Search (particularly entity list search), and (2) WikiTable (Bhagavatula et al., 2015;Zhang and Balog, 2018) for Entity Linking in Wikipedia Table. Then, we retain all semantic classes with at least 10 entities and obtain totally 60 classes covering 6 different types (e.g., Person, Location, Organization, etc). Table 9 shows some examples. Such generated classes have high precision but low recall in the sense that some correct entities are not included. In the following stage, we enlarge each semantic class and increase its coverage using crowdsourcing.
Query generation and class enrichment. For each semantic class, we generate 5 queries for each of four query sizes: 2, 3, 4, 5, which results in 20 queries per class and 1200 queries in total. Furthermore, we want those queries to cover both popular and long-tail entities. To achieve this goal, we first sort all entities based on their frequencies within each class. Then, we generate each subgroup of 5 queries (of the same size M ∈ {2, 3, 4, 5}) as follows: we select 1 query consisting of the M most frequent entities, 2 queries of entities in frequency quantile top-10%, and 2 queries of entities in frequency quantile [top-10%, top-30%].
After generating queries, we run all baseline methods to retrieval their top 100 results and aggregate all results to a set of 17,400 class, entity pairs. Next, we employ crowdworkers to check all those pairs on Amazon Mechanical Turk. Crowdworkers are required to have a 95% HIT acceptance rate, a minimum of 1000 HITs, and be located in the United States or Canada. Workers are asked to view one semantic class and six candidate entities, and to select all entities that belong to the given class. On average, workers spend 40 seconds on each task and are paid $0.1, which is equivalent to a $9 hourly payment. All class, entity pairs are labeled by three workers independently and the inter-annotator agreement is 0.8204, measured by Fleiss's Kappa (k). Finally, we enrich each semantic class C j by adding the entity e i whose corresponding pair C j , e i is labeled "True" by at least two workers.
Synonym set curation. To construct synonym sets in each semantic class, we first run all baseline methods to generate a candidate pool of possible synonymous pairs. Then, we enlarge this pool to include all term pairs that form an inflection 8 . After that, we automatically treat those terms that can be mapped to the same entity in WikiData 9 as positive pairs and manually label the remaining 7,625 pairs. The inter-annotator agreement is 0.8431. Note here we do not use Amazon MTurk because labeling synonym pairs are much simpler than labeling entity class membership and also has less ambiguity. Here, we avoid using YAGO KB in order to prevent possible data leakage problem. Next, we construct a synonym graph where each node is a term and each edge connects two synonymous terms. Finally, we extract all connected components in this synonym graph and treat them as synonym sets.

E SE2 Dataset Analysis
We analyze some properties of SE2 dataset from the following aspects: (1) semantic class size, (2) set expansion difficulty of each class, and (3) synonym discovery difficulty of each class.
Semantic class size. The 60 semantic classes in our SE2 dataset consist on average 145 entities (with a minimum of 16 and a maximum of 864) for a total of 8697 entities. After we grouping these entities into synonym sets, these 60 classes consist of on average 118 synsets (with a minimum of 14 and a maximum of 800) for totally 7090 synsets. The average synset size is 1.258 and the maximum size of one synset is 11.
Set expansion difficulty of each class. We define the set expansion difficulty of each semantic class as follows: where Topk(e) represents the set of k most similar entities to entity e in the vocabulary. We set k to be 10,000 in this study. Intuitively, this metric calculates the average portion of entities in class C that cannot be easily found by another entity in the same class. As shown in Table 2, the most difficult classes are those LOC classes 10 and the easiest ones are FAC classes.
Synonym discovery difficulty of each class. We continue to measure the difficulty of finding synonym pairs in each class. Specifically, we calculate two metrics: (1) Lexical difficulty defined as the average Jaro-Winkler distance 11 between the surface names of two synonyms, and (2) Semantic difficulty defined as the average cosine distance between two synonymous entities' embeddings.    6. CGExpan 16 : We use BERT-base-uncased as its underlying Language Model for generating class names. We run CGExpan for 5 iterations and each iteration finds 5 candidate classes and adds 10 most confident entities into the currently expanded set.
7. SynSetExpan: We set the ensemble times T = 50, the negative sampling ratio (in set expansion model) K = 10, the maximum iteration number max iter= 6, the number of fine-tuning trees H = 10, and the negative sampling ratio (in synonym discovery model) N = 10. For other (less important) hyper-parameters, we directly discuss their values in the paper and SynSetExpan is robust to those hyper-parameters.

F.2 Hyper-parameter Sensitivity Analysis
Within our SynSetExpan framework, there are two important hyper-parameters in the set expansion model: the ensemble times T and negative sampling ratio K. Figure 5 shows the hyper-parameter sensitivity analysis. We see that the model performance first increases when the ensemble times T increases from 1 to 10 and then becomes stable when T further increases. A similar trend is also witnessed on the negative sampling ratio K. (b) Negative Ratio K Figure 5: Sensitivity analysis of hyper-parameters T and K in SynSetExpan for the set expansion task.
these two hyper-parameters as long as their values are larger than 10.

F.3 Efficiency Analysis
We test the efficiency of our SynSetExpan framework (with T = 50 and K = 10) on a single server with 20 CPU threads. For each query, the first iteration of SynSetExpan on average takes 7.5 seconds, the first three iterations need 27 seconds, and the first six iterations consume 56 seconds. Later iterations take longer time because there are more entities in the already expanded set of that iteration. In comparison, one iteration of EgoSet takes 86 seconds, six iterations of SetExpan need 188 seconds, and CGExpan takes 174 seconds for five iterations on a 1080Ti GPU. This result shows the efficiency of SynSetExpan.

G.1 PubMed Dataset Details
Besides using our SE2 dataset, we also evaluate SynSetExpan for synonym discovery task on the public PubMed dataset which consists of a corpus of 1.5 million paper abstracts in biomedical domain, a vocabulary of 357,991 terms, and a collection of 203,648 synonym pairs (10,486 positive pairs and 193,162 negative pairs). All terms involved in synonym pairs are linked to one entity in UMLS knowledge base 17 and we group these terms into 10 semantic classes based on their linked entities' types.

G.2 Implementation Details and Hyper-parameter Choices
All compared synonym discovery methods are tested using the same distant supervision data (c.f. Section 3 in the main text) and hyper-parameter values are obtained using 5-fold cross validation.
17 https://uts.nlm.nih.gov/home.html We discuss the implementation details and hyperparameter choices of each compared synonym discovery methods below: 1. SVM 18 : We use the RBF kernel and set regularization parameter λ to be 0.3.

2.
XGBoost 19 : We set the maximum tree depth to be 5, γ = 0.1, η = 0.1, subsample ratio to be 0.5, and use the default values for all other hyper-parameters.
3. SynSetMine 20 : We use two hidden layers (of dimension 250, 500) for its internal set encoder. We learn the model using the "mix sampling" strategy.
4. DPE 21 : We set the embedding dimension as 300, λ = 0.3, and use the default values for all other hyper-parameters.
5. SynSetExpan: We use the same hyperparameter values as XGBoost to obtain the classagonistic synonym discovery model. During the fine-tuning stage, we fit 10 additional trees in each iteration. For other (less important) hyperparameters, we directly discuss their values in the paper and SynSetExpan is robust to those hyper-parameters.

G.3 Hyper-parameter Sensitivity Analysis
We study how sensitive SynSetExpan is to the choices of two fine-tuning hyper-parameters in its synonym discovery module: (1) the number of additional fitted trees H, and (2) the negative sampling ratio N in constructing pseudo-labeled dataset for fine-tuning. Results are shown in Figure 6. First, we find that our model is insensitive to the negative sampling ratio N in terms of all three metrics. Second, we notice that the model performance first increases as H increases until it reaches about 15 and then starts to decrease when we further increase H. Although SynSetExpan is somewhat sensitive to the hyper-parameter H, we find that a wide range of H choices are better than H = 0 which essentially disables the model fine-tuning.

G.4 Efficiency Analysis
By linking SE2 Dataset with YAGO KB, we can obtain 260 thousand synonym pairs based on which training a class-agnostic synonym discovery model takes 15 minutes. Then, each iteration of SynSetExpan generates on average 5000 pseudolabeled synonym pairs and fitting 10 additional trees needs about 0.75 seconds. After training, our synonym discovery model can predict 4000 term pairs per second.