Exploring Semantic Capacity of Terms

We introduce and study semantic capacity of terms. For example, the semantic capacity of artificial intelligence is higher than that of linear regression since artificial intelligence possesses a broader meaning scope. Understanding semantic capacity of terms will help many downstream tasks in natural language processing. For this purpose, we propose a two-step model to investigate semantic capacity of terms, which takes a large text corpus as input and can evaluate semantic capacity of terms if the text corpus can provide enough co-occurrence information of terms. Extensive experiments in three fields demonstrate the effectiveness and rationality of our model compared with well-designed baselines and human-level evaluations.


Introduction
Terms are not all considered equal. For instance, in computer science, the meaning scope of artificial intelligence or computer architecture is broader than that of linear regression. To study this phenomenon, in this paper, we introduce Semantic Capacity, which is a scalar value to characterize the meaning scope of a term. A good command of semantic capacity will give us more insight into the granularity of terms and allow us to describe things more precisely, which is a crucial step for downstream tasks such as keyword extraction (Hulth, 2003;Beliga et al., 2015;Firoozeh et al., 2020) and semantic analysis (Landauer et al., 1998;Goddard, 2011). Figure 1 shows the fingerprint visualization of a computer scientist, which is generated by Elsevier Fingerprint Engine 1 , a popular system that creates * Asterisk indicates equal contribution. Work done while visiting University of Illinois at Urbana-Champaign. 1 https://www.elsevier.com/solutions/ elsevier-fingerprint-engine an index of weighted terms for research profiling. From the example, we can find that there exist some non-ideal terms, such as learning whose semantic capacity is too high, backpropagation whose semantic capacity is too low, and even irrelevant terms such as color. Understanding semantic capacity of terms will help us to choose better terms to describe entities. Besides, combining with other techniques like word similarity, semantic capacity can also help keyword replacement. For instance, to describe the computer scientist depicted in Figure 1, if the audience is a layman of computer science, we should use terms with high semantic capacity like artificial intelligence. But for an expert in the corresponding domain, we can select terms with low semantic capacity like object recognition to make the fingerprint more precise.
However, there are countless terms in human language, which means that it is extremely hard to investigate semantic capacity for all existing terms. Besides, semantic capacity of terms is also ambiguous in different domains. For instance, cheminformatics may be considered as a term with low semantic capacity in computer science and a term with high semantic capacity in chemistry.
On the other hand, the information on terms we acquire is usually very limited and/or noisy. Although semantic taxonomies such as WordNet (Miller, 1995) provide rich semantic relations between words, the information is still limited, and these knowledge bases are expensive to maintain and extend. Besides, there exists some research work that models hierarchical structures of terms automatically, but most of them suffer from low recall or insufficient precision. For instance, hypernymy discovery (Hearst, 1992;Snow et al., 2005;Roller et al., 2018) aims at finding is-a relations in textual data. If we can find all the hypernymy pairs and construct a perfect tree structure that includes every term, the problem of semantic capacity can be solved to some extent. However, as far as we know, this is almost impossible in state of the art.
The above analysis shows that we should focus the problem on a specific domain and cover as many terms as possible with easily accessible information. Besides, we should also consider user requirements and deal with terms that are not included at first. Therefore, we propose a two-step model that only takes a text corpus as input and can evaluate semantic capacity of terms, provided that the text corpus can give enough co-occurrence signals. Our model consists of the offline construction process and the online query process. The offline construction process measures semantic capacity of terms in a specific semantic space, which narrows the problem to a specific domain and reduces the complexity of the problem to a practical level. The online query process deals with users' queries and evaluates newly added terms that users are interested in. To learn semantic capacity of terms with simple co-occurrences between terms, we introduce the Semantic Capacity Association Hypothesis and propose the Lorentz Model with Normalized Pointwise Mutual Information, where terms are placed in the hyperbolic space with a novel combination of normalized pointwise mutual information. Finally, norms of embeddings are interpreted as semantic capacity of terms.
The main contributions of our work are summarized as follows: • We study semantic capacity of terms. As far as we know, we are the first to introduce and clarify the definition of semantic capacity.
• We propose a two-step model to learn semantic capacity of terms with unsupervised methods. Theoretically, our model can evaluate semantic capacity of any terms appearing in the text corpus as long as the corpus can provide enough co-occurrence signals.
• We introduce the Semantic Capacity Association Hypothesis and propose the Lorentz model with NPMI, which is a novel application of NPMI to help place terms in the hyperbolic space. We also conceive a novel idea to interpret norms of embeddings as semantic capacity of terms.
• We conduct extensive experiments on three scientific domains. Results show that our model can achieve performance comparable to scientific professionals, with a small margin to experts, and much better than laymen.
The code and data are available at https:// github.com/c3sr/semantic-capacity.

Methodology
In this section, we introduce the definition of semantic capacity and describe our model in detail.
The overview of our model is shown in Figure 2.

Definition
The semantic capacity of a term depends on its inherent semantics, the context it is used in, and its associations to other terms in the context. For example, computer science is a term with a broad meaning, and it is considered parallel to other terms with broad meanings like physics and materials science. Besides, computer science is also the parent class of some terms with broad meaning scopes like artificial intelligence and computer architecture. However, understanding the inherent semantics of terms and modeling the associations between all terms found in human language are impractical due to limited resources. Therefore, in this paper, we focus on modeling semantic capacity for terms in a specific domain. The problem is defined as follows: Definition 1 (Semantic Capacity) The semantic capacity SC(·) of a term is a scalar value that

Online Query
If already in the space … Figure 2: The overview of the two-step model. The model first takes a text corpus as input, and a set of terms are extracted from the corpus. After that, with the training process, terms are placed in the hyperbolic space, and norms of embeddings are interpreted as semantic capacity of terms. For terms that users are interested in but have not already been in the hyperbolic space, the model trains online and returns the corresponding results.
evaluates the relative semantic scope of a term in a specific domain. And the larger the value, the broader the semantic scope.
Semantic capacity reflects the generality of the term in a specific domain of interests, and the larger the value, the more general of such a term. According to the Distributional Inclusion Hypotheses (DIH) (Geffet and Dagan, 2005), if X is the superclass of Y , then all the syntactic-based features of Y are expected to appear in X. Therefore, a term with a broad meaning scope is expected to contain all features of its subclasses, and these subclasses are also expected to contain features of other terms with narrow meaning scopes. Associations between terms can be considered as some kind of syntactic-based features. Therefore, terms with higher semantic capacity are more likely to associate with more terms. Besides, in addition to DIH, we also have a new observation that terms like artificial intelligence are more likely to have a strong association with its direct subclasses like machine learning than descendant classes like support vector machine, which means that terms with broader meaning scopes are more likely to associate with terms with broader meaning scopes. Therefore, we propose the Semantic Capacity Association Hypothesis as follows: Hypothesis 1 (Semantic Capacity Association Hypothesis) Terms with higher semantic capacity will be associated with 1) more terms, and 2) terms with higher semantic capacity than terms with lower semantic capacity.

Offline Construction Process
According to the analysis in the introduction, a feasible solution to measure semantic capacity is to focus on a specific domain. Therefore, we first introduce the offline construction process, which aims at learning semantic capacity of terms by taking a large text corpus as input with a number of domain-specific terms extracted from the corpus.
In this paper, to simplify the process and for easier evaluation, we use the public knowledge base Wikipedia Category 2 as a simple method to extract terms in a specific domain (more details are stated in Section 3.1). We can also extract terms from the domain-specific corpus directly by taking some term/phrase extraction methods (Velardi et al., 2001;Shang et al., 2018). After this process, our focus turns to learn semantic capacity of these extracted terms using the text corpus.
According to the Semantic Capacity Association Hypothesis, the key to measuring semantic capacity is to model associations between terms and then put terms in the proper place based on associations among them. Specifically, we aim to capture two types of associations between terms: semantic similarity, e.g., the association between AI (artificial intelligence) and ML (machine learning) is stronger than that between AI and DB (database) since ML is closer to AI than DB in meaning; and status similarity, e.g., the association between AI and ML is stronger than that between AI and SVM (support vector machine) since ML is more parallel to AI than SVM. On the other hand, the number of terms grows exponentially as semantic capacity gets lower, which means we need an exponentially increasing space to place terms. Therefore, we would like to design a method based on associations between terms to place terms in the hyperbolic space where circle circumference and volumes grow exponentially with radius.
Hyperbolic space is a kind of non-Euclidean geometry space represented by the unique, complete, simply connected Riemannian manifold with constant negative curvature. Recently, Nickel and Kiela (2017) proposed a hierarchical representation learning model, named the Poincaré ball model, based on the Riemannian manifold P n = (B n , g p ), where B n = {x ∈ R n : x < 1} is an open n-dimensional unit ball and g p is the Riemannian metric tensor, which is defined as where x ∈ B n and g E is the Euclidean metric tensor. The distance function on P n is given as (2) Given a set of terms and a text corpus, we can count the frequency of co-occurrences f req(x, y) between term x and y by traversing the corpus with a fixed window size. We can then learn representations of terms by using co-occurrence information directly based on the Poincaré ball model. Because of the restriction of hyperbolic space and the distance function, minimizing the loss described in (Nickel and Kiela, 2017) will be more likely placing terms co-occurring with more terms, especially those with higher co-occurrences, near the center of the Poincaré ball. If co-occurrences capture associations between terms well, according to the Semantic Capacity Association Hypothesis, semantic capacity of terms can be interpreted by norms of embeddings to some extent: SC(x) = 1/ x . However, co-occurrences between terms are very common. There are many valid reasons that terms co-occur. For instance, two terms may co-occur because they are parallel (e.g., machine learning and data mining), or one term includes the other term (e.g., artificial intelligence and machine learning). Meanwhile, more generally, irrelevant or distant terms may also co-occur. Therefore, the associations modeled by co-occurrences between terms are very noisy, leading to the result that terms with high frequency will co-occur with more terms; thus, they are more likely to be placed near the center.
However, the high frequency of a term cannot guarantee the term's semantic capacity also high. In contrast, there are cases in which terms with less frequency turn out to possess high semantic capacity. For instance, theoretical computer science is a term with high semantic capacity. However, it is much less commonly used than its subfield term such as graph theory.
With this in mind, to filter noise and better model associations between terms, we introduce normalized pointwise mutual information (NPMI) (Bouma, 2009) to help place terms in the hyperbolic space. Letting W represent the term set, the NPMI value of term x and y is given as where p(x, y) = 2 · f req(x, y)/Z and p(x) = f req(x)/Z with f req(x) = y∈W f req(x, y) and Z = x∈W f req(x).
Compared to pointwise mutual information (PMI), NPMI scales the value between −1 and 1, where −1 means x and y never co-occur, 0 means x and y occur independently, and 1 means x and y co-occur completely. If x and y possess a positive relation, given term y, term x will be more likely to occur in the window; thus the NPMI value will be positive. Therefore, in our model, we set a threshold δ > 0 to filter out pairs with negative or weak relations and use the remaining pairs to build the set of associations, which is D = {(x, y) : npmi(x, y) > δ}.
According to , the Poincaré ball model is not optimal to optimize; therefore, we apply the Lorentz model ) that can perform Riemannian optimization more efficiently and avoid numerical instabilities. The Lorentz model learns representa- The Lorentz model and the Poincaré ball model are equivalent since points in one space can be mapped to the other space . Compared to the Lorentz model, the Poincaré ball model is more intuitive to interpret the embeddings. Therefore, we adopt the Lorentz model in our training process and use the Poincaré ball to interpret semantic capacity of terms.
To learn semantic capacity of terms, we modify the classic loss function of the Lorentz model and propose a new version that considers the strength of association, named the Lorentz model with NPMI. Letting the loss function is given as where N (x) = {y|(x, y) / ∈ D} ∪ {x} is the set of negative examples for x, and Θ = {θ i } |W| i=1 represents the embeddings of terms, where θ i ∈ H n . For training, we randomly select a fixed number of negative samples for each associated pair and then try to minimize the distance between points in this pair, against the negative samples.
Therefore, we aim to solve the optimization problem as For optimization, we follow  and perform Riemannian SGD (Bonnabel, 2013).

Online Query Process
Since the terms that we are interested in may not be in the term set W extracted from the corpus, to evaluate the semantic capacity of newly added terms, we need an online training process to incorporate them into the system. Assuming a number of terms are already placed in the hyperbolic space, adding a few new terms has little impact on the semantic space and original embeddings. Therefore, we can treat already trained terms as anchor points and add new terms into the space dynamically. More specifically, given a new term a, we find its co-occurrences with the original terms in W in the large corpus and calculate the NPMI values for a according to Eq. (3). And the optimization problem is then given as min a − (a,y)∈Da npmi(a, y) · log s(a, y), (8) where D a is the set of associations that contain a.
The online query process is illustrated in the blue part of Figure 2, where users provide a set of terms. The model first examines whether those terms are already in the space; if so, the system returns the semantic capacity directly. For terms that are not in the space, the system calculates the associations between them and the anchor points in the corpus and solves the optimization problem in Eq. (8) by the Lorentz model with NPMI. Finally, semantic capacity of these new terms will be returned as the reciprocal of embedding norms in the Poincaré ball. To make the online process more efficient, we can save the statistical information (e.g., cooccurrences with the anchor points) of all terms appearing in the corpus. By doing this, each query can be finished in a short time.
All in all, combining the offline construction and the online query process, we not only deal with the computational problem by focusing on a specific domain, but also have the ability to evaluate semantic capacity of any terms appearing in the text corpus as long as the text corpus can provide enough co-occurrence information. Besides, the online training process can also be considered as a way to extend the semantic space.

Experiments
In this section, we conduct experiments to validate the effectiveness of our model.

Datasets
We conduct experiments in three fields, including computer science, physics, and mathematics. to build the set of terms W k , which is considered as a simple term extraction process from the corpus.
Since there are some irrelevant terms (considered as noise) in the category, we filter out terms whose "Page views in the past 30 days" ≤ 500 and length of words > 3. Besides, we filter out terms that contain numbers or special symbols. For evaluation, we also extract hypernym-hyponym pairs from Wikipedia Category.
Physics We use arXiv Papers Metadata Dataset 5 as input and extract terms from the corpus via Wikipedia Category of Physics 6 in the same way as computer science.
Mathematics We also use arXiv Papers Metadata Dataset as input and extract terms from the corpus via Wikipedia Category of Mathematics 7 .
Other settings are the same as computer science. Statistics of the data with respect to W 5 are listed in Table 1. Taking Physics as an example, we extract 1090 terms, including 14 at the top 1 level and 127 at the top 2. Among these terms, there are 1393 pairs of hypernym-hyponym, including 105 pairs whose hypernym is at the top 1 level and 452 at the top 2.

Experimental Setup
Since our tasks on semantic capacity are brand new and there is no existing baseline that uses cooccurrences between terms to evaluate semantic capacity of terms, we build or adapt the following models for our experiments: • Popularity: A simple method which uses the frequency f req(·) to evaluate the semantic capacity of each term, i.e., SC(x) ∝ f req(x).
We also design the following models for ablation study: • Euclidean Model (Co-occurrences): A variant of our model which uses the Euclidean space instead of the hyperbolic space and models associations between terms by frequency of co-occurrences instead of NPMI.
• Euclidean Model (NPMI): A variant of our model which uses the Euclidean space instead of the hyperbolic space.
• Lorentz Model (Co-occurrences): A variant of our model which models associations between terms by frequency of co-occurrences instead of NPMI.
Parameter Settings We performed manual tuning for all models and adopted the following values for the hyperparameters. For all tasks and datasets, to find the co-occurrences between terms, we set window size as 20. For the training of our models, we set embedding size as 20, batch size as 512, number of negative samples as 50, and NPMI threshold δ as 0.1. We repeated our experiments for 5 random seed initializations. All experiments were finished on one single NVIDIA GeForce RTX 2080 GPU under the Py-Torch framework.

Evaluation on Offline Construction
In this section, we test whether the offline construction part of our model can preserve semantic   capacity of terms well. Wikipedia Category can be considered as tree-structured, where each edge is a hypernym-hyponym (broader/narrower) pair so that we can use these pairs for our evaluation. We first conduct our experiments on the semantic capacity comparison task with term set W 5 : given a pair (x, y), determine whether the semantic capacity of x is higher than that of y. For each field, we evaluate the accuracy for all pairs (all), pairs with hypernym at the top 1 level (top 1), pairs with hypernym at the top 2 levels (top 2). The results are shown in Table 2.
From the results, we find that the Lorentz model with NPMI outperforms all the baselines significantly, which achieves satisfactory performances in all fields, especially for pairs with hypernym at the top 1 level. Here we should mention that disagreements exist in the evaluation. For instance, in Wikipedia Category, programming language theory is the parent class of programming language and computational neuroscience is the parent class of artificial intelligence. However, people may also agree that programming language is the superclass of programming language theory and artificial intelligence is broader than computational neuroscience.
Besides, compared with these variants of our model, the Lorentz model with NPMI has a significant performance improvement over them, which indicates the effectiveness of using filtered NPMI to characterize associations between terms and shows the superiority of placing terms in the hyperbolic space. In terms of training speed, taking the offline construction in computer science as an example, compared with the run time of the Lorentz model with co-occurrences (51s), the Lorentz model with NPMI also has an improvement in efficiency (30s).
To compare with methods based on lexicosyntactic patterns, we also try Hearst patterns (with extended patterns) (Hearst, 1992) to find the hypernymy relations for physical terms. The result shows that only 2.5% (35/1393) of the hypernymy pairs are detected, i.e., almost impossible to measure semantic capacity of terms.
In addition to evaluating on the pairs, we introduce a metric to evaluate the performance in a different way. Since semantic capacity is not strictly divided by levels of terms, it is possible that the semantic capacity of a term at the higher level is lower than that of a term at the lower level. But in general, the average rank of terms at the higher level should be higher than that of terms at the  lower level. Therefore, we use the average rank of terms at the top k levels (AR k ) as a metric to evaluate the performance, which is defined as where |W| denotes the cardinality of the term set and rank(x) is the ranking (being the top rank or the highest semantic capacity) of term x evaluated by the model. In other words, when k is small, the smaller AR k , the better. For terms at the top 1 level, the metric is sensitive to misordered terms, and the value will grow a lot when a term is ranked low. Again, semantic capacity is not strictly divided by levels of terms, but in general, terms at the higher level should have higher ranks (smaller in value). Results in Table 3 show that our model achieves the best performance, and the results are consistent with the results of the semantic comparison task.

Sensitivity to Term Set
The training process is affected by the term extraction process. Therefore, we want to detect model sensitivity with respect to the term set. For this purpose, we use W 5 and W 3 in each field as the term set respectively and conduct the semantic capacity comparison task for pairs with hypernym at the top 1 level and pairs with hypernym at the top 2 levels. From Figure 3, we can see the results are relatively stable. On the one hand, compared to W 3 , W 5 contains more terms, which means term set W 5 is more complete, but the training time also increases with the number of terms. On the other hand, since noise increases with the level in Wikipedia Category, W 5 contains more noisy terms than W 3 . In short, how to choose the term set depends on many factors, such as the task we care about and the noise contained in the term set we acquire.

Evaluation on Online Query
In this section, experiments are conducted to validate the performance of the online query process on evaluating semantic capacity of newly added terms. We randomly select 100 hypernym-hyponym pairs at the top 3 levels of each evaluation set for online query and use the remaining terms in W 3 for offline construction. We compare our model with human annotation by three groups of people, where each pair is labeled by three unique people. Details of human annotation are listed as follows: • Human Annotation (Layman): Human annotation by workers on Amazon Mechanical Turk 8 with "HIT Approval Rate" ≥ 95% (considered as high quality).
• Human Annotation (Professional): Human annotation by non-major students in the United States. Specifically, we ask math, computer science, physics students to conduct annotation tasks for physics, math, computer science, respectively.
• Human Annotation (Expert): Human annotation by corresponding major students.
From the results shown in Table 4, we find that our model far outperforms human annotation by laymen in all fields. And the performance of our model is comparable to that of human annotation by professionals, with a small margin to experts. The results also imply disagreements exist in the evaluation since experts cannot achieve accuracies close to 100%. Besides, for both our model and human annotation, the top 1 accuracy is usually higher than top 2 accuracy, and the top 2 accuracy is higher than accuracy for all pairs, which is in line with common sense that semantic capacity of terms at the top levels is usually easier to evaluate. In short, the results demonstrate the effectiveness of our model for evaluating semantic capacity of newly added terms. Furthermore, our model can be applied to semantic capacity query for terms that are not included in the offline process.
Our work is related to research on lexical semantics (Cruse, 1986). Among them, hypernymy, also known as is-a relation, has been studied for a long time. A well-known method is the Hearst patterns (Hearst, 1992), which extracts hypernymy pairs from a text corpus by hand-crafted lexicosyntactic patterns. Inspired by the Hearst patterns, some other pattern-based-methods like (Snow et al., 2005;Roller et al., 2018) are proposed successively. On the other hand, hypernymy discovery based on distributional approaches has also attracted widespread interest (Weeds et al., 2004;Lenci and Benotto, 2012;Chang et al., 2018).
The techniques our model based on are related to research on learning representations of symbolic data in the hyperbolic space (Krioukov et al., 2010;Kiela, 2017, 2018). Since text preserves natural hierarchical structures, Dhingra et al. (2018) design a framework that learns word and sentence embeddings in an unsupervised manner from text corpora, Tifrea et al. (2019) propose Poincaré GloVe to learn word embeddings based on the GloVe algorithm in the hyperbolic space, Aly et al. (2019) use Poincaré embeddings to improve exiting methods to domain-specific taxonomy induction, and Le et al. (2019) propose a method to predict missing hypernymy relations and correct wrong extractions for Hearst patterns based on the hyperbolic entailment cones (Ganea et al., 2018).

Conclusion
In this paper, we explore semantic capacity of terms. We first introduce the definition of semantic capacity and propose the Semantic Capacity Association Hypothesis. After that, we propose a two-step model to investigate semantic capacity of terms, which consists of the offline construction and the online query processes. The offline construction process places domain-specific terms in the hyperbolic space by our proposed Lorentz model with NPMI, and the online query process deals with user requirements, where semantic capacity is interpreted by norms of embeddings. Extensive experiments with datasets from three fields demonstrate the effectiveness and rationality of our model compared with well-designed baselines and human-level evaluations.
In addition, while semantic capacity studied in this paper is restricted to a specific domain, we believe the notion of semantic capacity can be ex-tended to all terms in human language. The extension of the scope will be the future work.