A Two-Stage Masked LM Method for Term Set Expansion

We tackle the task of Term Set Expansion (TSE): given a small seed set of example terms from a semantic class, finding more members of that class. The task is of great practical utility, and also of theoretical utility as it requires generalization from few examples. Previous approaches to the TSE task can be characterized as either distributional or pattern-based. We harness the power of neural masked language models (MLM) and propose a novel TSE algorithm, which combines the pattern-based and distributional approaches. Due to the small size of the seed set, fine-tuning methods are not effective, calling for more creative use of the MLM. The gist of the idea is to use the MLM to first mine for informative patterns with respect to the seed set, and then to obtain more members of the seed class by generalizing these patterns. Our method outperforms state-of-the-art TSE algorithms. Implementation is available at: https://github.com/ guykush/TermSetExpansion-MPB/


Introduction
Term Set expansion (TSE) is the task of expanding a small seed set of terms into a larger (ideally complete) set of terms that belong to the same semantic category. For example, the seed set {"orange", "apple"} should expand into a set of fruits, while {"orange", "blue"} into a set of colors, and {"apple","google"} into a set of tech companies. Beyond being of great practical utility, the TSE task is a challenging instance of a generalization from few examples problem. Solving TSE requires the algorithm to: (1) identify the desired concept class based on few examples; and (2) identify additional members of the class.
We present an effective TSE method which is based on querying large, pre-trained masked language models (MLMs). Pre-trained language models (LMs) have been shown to contain semantic (Tenney et al., 2019), syntactic (Goldberg, 2019;Hewitt and Manning, 2019;Linzen et al., 2016) and factual knowledge (Petroni et al., 2019), and to be great starting points for transfer-learning to new tasks via fine-tuning on few examples. However, the TSE seed sets are too small for fine-tuning, calling for a different approach. Our method uses the MLMs directly for the task they were trained for-language-modeling-by issuing word-completion queries and operating on the returned word distributions. 1 Previous solutions to the TSE problem (also called semantic class induction) can be roughly categorized into distributional and pattern-based approaches (Shi et al., 2010). Our method can be seen as a combination of the two.
The distributional approach to TSE (Hindle, 1990;Pantel and Lin, 2002;Pantel et al., 2009;Mamou et al., 2018;Mahabal et al., 2018) operates under the hypothesis that similar words appear in similar contexts (Harris, 1968). These methods represent each term in the vocabulary as an embedding vector that summarizes all the contexts the term appears in in a large corpus, and then look for terms with vectors that are similar to those of the seed term. The methods differ in their context definitions and in their way of computing similarities. A shortcoming of these methods is that they consider all occurrences of a term in the corpus when calculating its representation, including many contexts that are irrelevant to the concept at hand due to polysemy, noise in the corpus or noninformative contexts. 2 In contrast, the pattern-based approach consid-ers specific indicative patterns that signal the desired concept, looking for them in a large corpus, and extracting the terms that appear in them. Patterns can be binary (Hearst, 1992;Ohshima et al., 2006;Zhang et al., 2009) ("such as X or Y"), indicating that both X and Y belong to the same class, or unary (Gupta and Manning, 2014;Wang and Cohen, 2007) ("fruits such as X", "First I painted the wall red, but then I repainted it X"), suggesting that X belongs to a certain category (fruit, color). The patterns can be determined manually (Hearst, 1992) or automatically (Wang and Cohen, 2007;Gupta and Manning, 2014). While well tailored patterns can be precise and interpretable, a notable shortcoming of pattern-based methods is their lack of coverage, due to the challenge of finding patterns that are specific enough to be accurate yet common enough in a large corpus to be useful. Wang and Cohen (2007) use patterns from nonnatural language (HTML) while Gupta and Manning (2014) restrict themselves to short patterns of 2-4 words to each side of the masked term.
Our method. By using MLMs, we combine the power of the pattern-based and the distributional approaches: like the patterns-based approaches, we consider only specific, indicative corpus locations (retaining specificity and transparency). We then use the distributional nature of the neural LM to generalize across patterns and corpus locations. We use sentences with a single masked location as indicative patterns. For example, ''We took Rexy, our pet , to the vet." is an indicative pattern for the house animals semantic class. Given an initial set of seed terms, we first search the corpus for indicative patterns for members of the set (2.1). Intuitively, an indicative pattern is a corpus location which is considered by an LM to be a good fit for all seed members. Once we identified indicative patterns, we extend the set to terms that can appear in similar patterns. We propose two methods for doing this. The first method (2.2) queries an MLM for completions. While effective, this method restricts the expanded set to the LM vocabulary. The second method (2.3) uses the MLM to define a similarity metric over patterns, and searches the corpus for terms that appear in patterns that are similar to the indicative ones. To summarize, we embrace the pattern-based approach, while using distributional similarity for identifying good patterns as well as for generalizing across patterns.

Method
Task formulation we are given a seed set S of k 3 terms S = t 1 , ..., t k , that come from a larger (and unknown) gold set S g . Our goal is to return S g . Practically, our (and other) algorithms return a ranked list of terms rather than a fixed set. The evaluation is then performed over the ranking: ideally, all terms in S g will be ranked above all terms not in S g .
We operate in stages. First, we search the corpus for indicative masked patterns m 1 , ..., m , that are likely to signal the concept class in S g with high probability. Then, we use the patterns to extend the set.

Finding indicative masked-patterns
A masked pattern m is a sequence of words with a single masked location (marked as " "), where the mask indicates one or more words. We look for patterns such that, with high probability, instances of the desired semantic class will make good mask replacements, while instances of other classes will make bad replacements. For example, "The capital of " is a good pattern for the "countries" class.
We collect L pattern candidates for each seed term t j by querying a corpus for sentences that contain the term, and replacing the term position with a mask. We then score each of the kL resulting pattern candidate m i , and take the -best ones.
Intuitively, we seek a diverse set of patterns in which all seed terms are ranked high (ie, have low rank index) in the MLM's prediction: we look for patterns whose worst-fitting seed term is still high on the list of replacement terms. Formally, let LM (m) be the word completions (mask replacements) predicted by the LM for pattern m, ranked by their probability, and let R LM (t, m) be the rank (index) of term t in LM (m).
The score of the a pattern is then the maximal rank of any of the seed terms: 4 We then sort the patterns by s(m i ) and take the patterns with minimal values. This min-over-max formulation ensures that the patterns are a good fit for all seed terms. 5 To achieve the diversity objective, we use the following heuristic: after sorting all candidate patterns m i by s(m i ), rather than taking the first items we go over the sorted list in order, and keep a pattern only if it differs by at least 50% of it's tokens from an already kept pattern. We do this until collecting patterns.

seed set extension via MLM query
Having identified indicative patterns, we now turn to suggest terms for expanding the seed set. Each indicative pattern m i naturally provides a ranked list of candidate terms LM (m i ) = t 1 , ..., t |V | , where V is the LM's vocabulary and each term t j is scored by its pattern-conditional probability. We combine the term scores from all chosen indicative patterns using a product of experts approach, scoring each term by the product of probabilities (sum of log probabilities) assigned to it by each context. Let p LM (t|m i ) be the probability assigned to vocabulary term t in pattern m i . The term score is: where c i = (maxRank(m i ) −1 j=1 (maxRank(m j ) −1 is a weighing factor for indicative pattern m i , giving more weight to "tighter" indicative patterns.
This method is fast and effective, requiring only queries to the LM. However, it assumes that all the desired terms from S g appear as vocabulary 5 Contrast this to a min-over-average formulation, which may score very well on some seed terms but badly on others. items in the LM. This assumption often does not hold in practice: first, for efficiency reasons, pretrained LM vocabularies are often small (∼ 50k items), precluding rare words. Second, many terms of interest are multi-word units, that do not appear as single items in the LM vocabulary.

Extended coverage via pattern similarity
We seek a term expansion method that will utilize the power of the pre-trained LM, without being restricted by its vocabulary: we would like to identify rare words, out-of-domain words, and multiword units.
Our solution is to generalize the indicative patterns. Rather than looking for terms that match the patterns, we instead search a large corpus for patterns which are similar to the indicative ones, and collect the terms that appear within them. Following the distributional hypothesis, these terms should be of the desired concept class.
By looking at patterns that surround corpus locations, we are no longer restricted by the LM vocabulary to single-token terms.
However, considering all corpus locations as candidate patterns is prohibitively expensive. Instead, we take a ranking approach and restrict ourselves only to corpus locations that correspond to occurrences of candidate terms returned by a highrecall algorithm. 6 We use the LM to define a similarity measure between two masked patterns that aims to capture our desired notion of similarity: masked patterns are similar if they are likely to be filled by the same terms. Let top q (LM (m i )) be the q highest scoring terms for pattern m i . We define the similarity between two patterns as the fraction of shared terms in their top q predictions (q being a hyperparameter): sim(m i , m j ) = |top q (LM (m i )) ∩ top q (LM (m j ))|/q For a candidate term t, let pats(t) = m t 1 , ..., m t n be the set of patterns derived from it: sentences that contain t, where t is replaced with a mask. Note that t can be an arbitrary word or word sequence. We wish to find terms for which the similarity between pats(t) and the indicative patterns is high. However, since words have dif-  ferent senses, it is sufficient for only some patterns in pats(t) to be similar to patterns in m 1 , ..., m . We score a term t as: where c i is the pattern weighing factor from equation (2). As i=1 c i = 1, the term score score(t) for every term t is ∈ [0, 1].

Experiments and Results
We refer to the method in Section (2.2)  These are hard to come by. Indeed, previous work either used a small number of sets, or used some automatic set acquiring method which commonly are not complete. We curated a dataset with 7 closed, well defined sets, which we make publicly available. The sets are National football league teams (NFL, size:32), Major league baseball teams (MLB, 30), US states (States, 50), Countries (Cntrs,195), European countries (Euro, 44) Capital cities (Caps,195) and Presidents of the USA (Pres, 44). We also provide on one open class set: Music Genres (Genre). This set created by manually verifying the items in the union of the output of all the different algorithms. This set contains around 600 unique items. Compared Methods. We compare our methods, MPB1 (MLM-pattern-based) (Section 2.2) and MPB2 8 (Section 2.3), to two state-of-the-art systems: setExpander 9 (SE) (Mamou et al., 2018), and category builder (CB) (Mahabal et al., 2018).
We also compare to two baselines: The first, BB (basic-BERT), is a baseline for MPB1. This is a BERT-based baseline that uses the MPB1 method on patterns derived from sentences that include seed terms, without the selection method described in Section 2.1. The second, S2V, is a baseline for MPB2. This is a basic distributional method that uses sense2vec (Trask et al., 2015) representations, 10 which is also our candidate acquisition method for MPB2 (A). As MPB2 relies on external candidate generation, we also report on the oracle case MPB2+O where we expand the S2V-generated candidate list to include all the members of the class.
Main Results. Our main results are reported in Table 3. Our first method, MPB1, achieves the best scores on two of the three sets suitable for its limitations (where all or most of the set's terms are in the LM's vocabulary), and second-best results on the third. 11 MPB2 outperforms all other methods on 5 out of 7 closed sets when assuming goldstandard candidates (MPB2+O), and even when considering the missing candidates it outperforms other expanders on 4 out of 7 closed sets, averaging the best MAP score on all sets. While other Additional experiments. How many sentences should we query when searching for indicative patterns, and how many patterns should we retain? Table 1 shows a grid of these parameters. We use the NFL set for this experiment, as terms in this set all have more than one meaning, and for most the common usage is not the one that belongs to the NFL set (e.g "jets", "dolphins"). Therefore, this set should give a pessimistic estimation for the the number of sentences we need to extract to find quality indicative patterns. Results imply that ∼2000 appearances of seed terms are sufficient, and that good results can be obtained also with fewer instances. This shows that-beyond the data used to train the initial MLM-we do not require a large corpus to achieve good results, suggesting applicability also in new domains. 13 How sensitive is the algorithm to the choice of k when computing the pattern similarity? Table 2 shows that the similarity measure is effective for various k values, with max performance at ∼50.
Finally, how do the different methods behave in a case where the seed terms are a part of a sub-12 SE does not rank the seed terms, as opposed to other methods. For fairness, we add them in the beginning of the returned list before computing the MAP score. 13 While for MPB1 there are no prominent downsides in using a large number of indicative patterns, for MPB2 doing so will force us to use a large number of occurrences of the candidate terms also. This will (1) be costly run-time wise and (2) many occurrences of rare terms might not always be available. Therefore, we choose different parameters for MPB1 and MPB2. While in both we will use 2000 sentences to search for these indicative patterns (L = 2000/k), for MPB1 we will use 160 indicative patterns ( = 160) and for MPB2 we will use only 20 of them ( = 20).  set? Table 4 shows a case where seed terms are European countries. Ideally, we would like top results to be European countries, later results to be non-European countries, and then unrelated terms. MPB2+O achieves the best MAP scores on both the set and the subset. In the subset case, even when not provided with all oracle terms, MPB2 is better then all other expanders. While other expanders tend to reach stronger results on either the set or the subset, MPB2+O achieves similar scores on both.

Conclusions
We introduce an LM-based TSE method, reaching state-of-the-art results. The method uses the power of LM predictions to locate indicative patterns for the concept class indicated by the seed terms, and then to generalize these patterns to other corpus locations. Beyond strong TSE results, our method demonstrates a novel use of pre-trained MLMs, using their predictions directly rather than relying on their states for fine-tuning.