TextEssence: A Tool for Interactive Analysis of Semantic Shifts Between Corpora

Embeddings of words and concepts capture syntactic and semantic regularities of language; however, they have seen limited use as tools to study characteristics of different corpora and how they relate to one another. We introduce TextEssence, an interactive system designed to enable comparative analysis of corpora using embeddings. TextEssence includes visual, neighbor-based, and similarity-based modes of embedding analysis in a lightweight, web-based interface. We further propose a new measure of embedding confidence based on nearest neighborhood overlap, to assist in identifying high-quality embeddings for corpus analysis. A case study on COVID-19 scientific literature illustrates the utility of the system. TextEssence can be found at https://textessence.github.io.


Introduction
Distributional representations of language, such as word and concept embeddings, provide powerful input features for NLP models in part because of their correlation with syntactic and semantic regularities in language use (Boleda, 2020). However, the use of embeddings as a lens to investigate those regularities, and what they reveal about different text corpora, has been fairly limited. Prior work using embeddings to study language shifts, such as the use of diachronic embeddings to measure semantic change in specific words over time (Hamilton et al., 2016;Schlechtweg et al., 2020), has focused primarily on quantitative measurement of change, rather than an interactive exploration of its qualitative aspects. On the other hand, prior work on interactive analysis of text collections has focused on analyzing individual corpora, rather than facilitating inter-corpus analysis (Liu et al., 2012;Weiss, 2014;Liu et al., 2019).
We introduce TextEssence, a novel tool that combines the strengths of these prior lines of research by enabling interactive comparative analysis of different text corpora. TextEssence provides a multiview web interface for users to explore the properties of and differences between multiple text corpora, all leveraging the statistical correlations captured by distributional embeddings. TextEssence can be used both for categorical analysis (i.e., comparing text of different genres or provenance) and diachronic analysis (i.e., investigating the change in a particular type of text over time).
Our paper makes the following contributions: • We present TextEssence, a lightweight tool implemented in Python and the Svelte JavaScript framework, for interactive qualitative analysis of word and concept embeddings.
• We introduce a novel measure of embedding confidence to mitigate embedding instability and quantify the reliability of individual embedding results.
• We report on a case study using TextEssence to investigate diachronic shifts in the scientific literature related to COVID-19, and demonstrate that TextEssence captures meaningful month-to-month shifts in scientific discourse.
The remainder of the paper is organized as follows. §2 lays out the conceptual background behind TextEssence and its utility as a corpus analysis tool. In §3 and §4, we describe the nearestneighbor analysis and user interface built into TextEssence. §5 describes our case study on scientific literature related to COVID-19, and §6 highlights key directions for future research.

Background
Computational analysis of text corpora can act as a lens into the social and cultural context in which those corpora were produced (Nguyen et al., 2020). Diachronic word embeddings have been shown to reflect important context behind the corpora they are trained on, such as cultural shifts (Kulkarni et al., 2015;Hamilton et al., 2016;Garg et al., 2018), world events (Kutuzov et al., 2018), and changes in scientific and professional practice (Vylomova et al., 2019). However, these analyses have proceeded independently of work on interactive tools for exploring embeddings, which are typically limited to visual projections (Zhordaniya et al.;Warmerdam et al., 2020). TextEssence combines these directions into a single general-purpose tool for interactively studying differences between any set of corpora, whether categorical or diachronic.

From words to domain concepts
When corpora of interest are drawn from specialized domains, such as medicine, it is often necessary to shift analysis from individual words to domain concepts, which serve to reify the shared knowledge that underpins discourse within these communities. Reified domain concepts may be referred to by multi-word surface forms (e.g., "Lou Gehrig's disease") and multiple distinct surface forms (e.g., "Lou Gehrig's disease" and "amyotrophic lateral sclerosis"), making them more semantically powerful but also posing distinct challenges from traditional word-level representations.
A variety of embedding algorithms have been developed for learning representations of domain concepts and real-world entities from text, including weakly-supervised methods requiring only a terminology (Newman-Griffis et al., 2018); methods using pre-trained NER models for noisy annotation (De Vine et al., 2014;Chen et al., 2020); and methods leveraging explicit annotations of concept mentions (as in Wikipedia) (Yamada et al., 2020). 1 These algorithms capture valuable patterns about concept types and relationships that can inform corpus analysis (Runge and Hovy, 2020).
TextEssence only requires pre-trained embeddings as input, so it can accommodate any embedding algorithm suiting the needs and characteristics of specific corpora (e.g. availability of annotations or knowledge graph resources). Furthermore, while the remainder of this paper primarily refers to concepts, TextEssence can easily be used for word-level embeddings in addition to concepts.

Why static embeddings?
Contextualized, language model-based embeddings can provide more discriminative features for 1 The significant literature on learning embeddings from knowledge graph structure is omitted here for brevity. NLP than static (i.e., non-contextualized) embeddings. However, static embeddings have several advantages for this comparative use case. First, they are less resource-intensive than contextualized models, and can be efficiently trained several times without pre-training to focus entirely on the characteristics of a given corpus. Second, the scope of what static embedding methods are able to capture from a given corpus has been well-established in the literature, but is an area of current investigation for contextualized models (Jawahar et al., 2019;Zhao and Bethard, 2020). Finally, the nature of contextualized representations makes them best suited for context-sensitive tasks, while static embeddings capture aggregate patterns that lend themselves to corpus-level analysis. Nevertheless, as work on qualitative and visual analysis of contextualized models grows (Hoover et al., 2020), new opportunities for comparative analysis of local contexts will provide fascinating future research.

Identifying Stable Embeddings for Analysis
While embeddings are a well-established means of capturing syntax and semantics from natural language text (Boleda, 2020), the problem of comparing multiple sets of embeddings remains an active area of research. The typical approach is to consider the nearest neighbors of specific points, consistent with the "similar items have similar representations" intuition of embeddings. This method also avoids the conceptual difficulties and low replicability of comparing embedding spaces numerically (e.g. by cosine distances) (Gonen et al., 2020). However, even nearest neighborhoods are often unstable, and vary dramatically across runs of the same embedding algorithm on the same corpus (Wendlandt et al., 2018;Antoniak and Mimno, 2018). In a setting such as our case study, the relatively small sub-corpora we use (typically less than 100 million tokens each) exacerbate this instability. Therefore, to quantify variation across embedding replicates and identify informative concepts, we introduce a measure of embedding confidence. 2 We define embedding confidence as the mean overlap between the top k nearest neighbors of a  given item between multiple embedding replicates. Formally, let E 1 . . . E m be m embedding replicates trained on a given corpus, and let kNN i (c) be the set of k nearest neighbors by cosine similarity of concept c in replicate E i . Then, the embedding confidence EC@k is defined as: We can then define the set of high-confidence concepts for the given corpus as the set of all concepts with an embedding confidence above a given threshold. A higher threshold will restrict to highlystable concepts only, but exclude the majority of embeddings. We recommend an initial threshold of 0.5, which can be configured based on observed quality of the filtered embeddings.

Computing aggregate nearest neighbors
After filtering for high-confidence concepts, we summarize nearest neighbors across replicates by computing aggregate nearest neighbors. The aggregate neighbor set of a concept c is the set of high-confidence concepts with highest average cosine similarity to c over the embedding replicates. More precisely, let D i (c) be the vector of pairwise similarities between concept c and all concepts in the embedding vocabulary, 3 as calculated in replicate E i . Then, we calculate the aggregate pairwise 3 All embedding replicates trained on a given corpus will share the same vocabulary. distance vector for c as: The k aggregate nearest neighbors of c, kN N Agg (c), are then the k concepts with lowest values in D Agg (c). This helps to provide a more reliable picture of the concept's nearest neighbors, while excluding concepts whose neighbor sets are uncertain.

The TextEssence Interface
The workflow for using TextEssence to compare different corpora is illustrated in Figure 2. Given the set of corpora to compare, the user (1) trains embedding replicates on each corpus; (2) identifies the high-confidence set of embeddings for each corpus; and (3) provides these as input to TextEssence. TextEssence then offers three modalities for interactively exploring their learned representations: (1) Browse, an interactive visualization of the embedding space; (2) Inspect, a detailed comparison of a given concept's neighbor sets across corpora; and (3) Compare, a tool for analyzing the pairwise relationships between two or more concepts.

Browse: visualizing embedding changes
The first interface presented to the user is an overview visualization of one of the embedding spaces, projected into 2-D using t-distributed Stochastic Neighbor Embedding (t-SNE) (van der Maaten and Hinton, 2008). High-confidence concepts are depicted as points in a scatter plot and may be color-coded based on pre-existing groupings; for example, in our case study ( §5), we color-coded The user can select a point to highlight its aggregated nearest neighbors in the high-dimensional space, an interaction similar to TensorFlow's Embedding Projector (Smilkov et al., 2016) that helps distinguish true neighbors from artifacts of the dimensionality reduction process.
The Browse interface also expands upon existing dimensionality reduction tools by enabling visual comparison of multiple corpora (e.g., embeddings from individual months). This is challenging because the embedding spaces are trained separately, and can therefore differ greatly in both highdimensional and reduced representations. While previous work on comparing projected data has focused on algorithmically aligning projections (Liu et al., 2020;Chen et al., 2018) and adding new comparison-focused visualizations (Cutura et al., 2020), we chose to align the projections using a simple Procrustes transformation and enable the user to compare them using animation.
When the user hovers on a corpus thumbnail, pre-view lines are shown between the positions of each concept in the current and destination corpora. The direction of each line is disambiguated by increasing its width as the line approaches its destination. In addition, the width and opacity of each point's preview line are proportional to the fraction of the point's aggregate nearest neighbors that differ between the source and destination corpora. This serves to draw attention to the concepts that shift the most. Upon clicking the corpus thumbnail, the points smoothly follow their trajectory lines to form the destination plot. In addition, when a concept is selected, the user can opt to center the visualization on that point and then transition between corpora, revealing how neighboring concepts move relative to the selected one.

Inspect: tracking individual concept change
Once a particular concept of interest has been identified, the Inspect view presents an interactive table depicting how that concept's aggregated nearest neighbors have changed over time. This view also displays other contextualizing informa- tion about the concept, including its definitions (derived from the UMLS (Bodenreider, 2004) for our case study 4 ), the terms used to refer to the concept (limited to SNOMED CT for our case study), and a visualization of the concept's embedding confidence over the sub-corpora analyzed. For information completeness, we display nearest neighbors from every corpus analyzed, even in corpora where the concept was not designated high-confidence (note that a concept must be high-confidence in at least one corpus to be selectable in the interface). In these cases, a warning is shown that the concept itself is not high-confidence in that corpus; the neighbors themselves are still exclusively drawn from the high-confidence set.

Compare: tracking pair similarity
The Compare view facilitates analysis of the changing relationship between two or more concepts across corpora (e.g. from month to month). This view displays paired nearest neighbor tables, one per corpus, showing the aggregate nearest neighbors of each of the concepts being compared. An adjacent line graph depicts the similarity between the concepts in each corpus, with one concept specified as the reference item and the others serving as comparison items (similar to Figure 3). Similarity between two concepts for a specific corpus is calculated by averaging the cosine similarity between the corresponding embeddings in each replicate.

Case Study: Diachronic Change in CORD-19
The scale of global COVID-19-related research has led to an unprecedented rate of new scientific findings, including developing understanding of the complex relationships between drugs, symptoms, comorbidities, and health outcomes for COVID-19 patients. We used TextEssence to study how the contexts of medical concepts in COVID-19-related scientific literature have changed over time. Table 1 shows the number of new articles indexed in the COVID-19 Open Research Dataset (CORD-19; Wang et al. (2020a)) from its beginning in March 2020 to the end of October 2020; while additions of new sources over time led to occasional jumps in corpus volumes, all are sufficiently large for embedding training. We created disjoint sub-corpora containing the new articles indexed in CORD-19 each month for our case study. CORD-19 monthly corpora were tokenized using ScispaCy (Neumann et al., 2019), and concept embeddings were trained using JET (Newman-Griffis et al., 2018), a weakly-supervised concept embedding method that does not require explicit corpus annotations. We used SNOMED Clinical Terms (SNOMED CT), a widely-used reference representing concepts used in clinical reporting, as our terminology for concept embedding training, using the March 2020 interim release of SNOMED CT International Edition, which included COVID-19 concepts. We trained JET embeddings using a vector dimensionality d = 100 and 10 iterations, to reflect the relatively small size of each corpus. We used 10 replicates per monthly corpus, and a high-confidence threshold of 0.5 for EC@5.

Findings
TextEssence captures a number of shifts in CORD-19 that reflect how COVID-19 science has developed over the course of the pandemic. Table 2 highlights key findings from our preliminary investigation into concepts known a priori to be relevant. Please note that while full nearest neighbor tables are omitted due to space limitations, they can be accessed by downloading our code and following the included guide to inspect CORD-19 results.
44169009 Anosmia While associations of anosmia (loss of sense of smell) were observed early in the pandemic (e.g., Hornuss et al. (2020), posted in May 2020), it took time to begin to be utilized as a diagnostic variable (Talavera et al., 2020;Wells et al., 2020). Anosmia's nearest neighbors reflect this, staying stably in the region of other otolaryngological concepts until October (when Talavera et al. (2020);Wells et al. (2020), inter alia were included in , where we observe a marked shift in utilization to play a similar role to other common symptoms of COVID-19.
116568000 Dexamethasone The corticosteroid dexamethasone was recognized early as valuable for treating severe COVID-19 symptoms (Lester et al. (2020), indexed July 2020), and its role has remained stable since (Ahmed and Hassan (2020), indexed October 2020). This is reflected in the shift of its nearest neighbors from prior contexts of traumatic brain injury (Moll et al., 2020) to a stable neighborhood of other drugs used for COVID-19 symptoms. However, in September 2020, 702806008 Ruxolitinib emerges as Dexamethasone's nearest neighbor. This reflects a spike in literature investigating the use of ruxolitinib for severe COVID-19 symptom management (Gozzetti et al., 2020;Spadea et al., 2020;Li and Liu, 2020). As the similarity graph in Figure 3 shows, the contextual similarity between dexamethasone and ruxolitinib steadily increases over time, reflecting the growing recognition of ruxolitinib's new utility (Caocci and La Nasa (2020), indexed May 2020).
83490000 Hydroxychloroquine Hydroxychloroquine, an anti-malarial drug, was misleadingly promoted as a potential treatment for COVID-19 by President Trump in March, May, and July 2020, leading to widespread misuse of the drug (Englund   Rahmani et al. (2020), all indexed August 2020). This shift is reflected in the neighbors of Hydroxychloroquine, adding investigative outcomes such as nosocomial (hospital-acquired) infections and respiratory failure to the expected anti-malarial neighbors.

Discussion
Our case study on scientific literature related to COVID-19 demonstrates that TextEssence can be used to study diachronic shifts in usage of domain concepts. We highlight three directions for future work using TextEssence: mining for new shifts and associations in changing literature ( §6.1); applications between comparative analysis of corpora ( §6.2); and further investigation of embedding confidence as a tool for analysis ( §6.3).

Mining shifts in the literature
While our primary focus in developing TextEssence was on its use as a qualitative tool for targeted inquiry, diachronic embeddings have significant potential for knowledge discovery through quantitative measurement of semantic differences. For example, new embeddings could be generated for subsequent months of CORD-19 (or other corpora), and analyzed to determine what concepts shifted the most-indicating current trends-or what concepts are just starting to shift-suggesting potential future developments.
However, quantitative, vector-based comparison of embedding spaces faces significant conceptual challenges, such as a lack of appropriate alignment objectives and empirical instability (Gonen et al., 2020). While nearest neighbor-based change measurement has been proposed (Newman-Griffis and Fosler-Lussier, 2019;Gonen et al., 2020), its efficacy for small corpora with limited vocabularies remains to be determined. Our novel embedding confidence measure offers a step in this direction (see §6.3 for further discussion), but further research is needed.

Other applications of TextEssence
A previous study on medical records (Newman-Griffis and Fosler-Lussier, 2019) showed that the technologies behind TextEssence can be used for categorical comparison as well as analysis of temporal shifts. More broadly, the use of TextEssence is not limited to comparison of text corpora alone. In settings where multiple embedding strategies are available, such as learning representations of domain concepts from text sources (Beam et al., 2020;Chen et al., 2020), knowledge graphs (Grover and Leskovec, 2016), or both (Yamada et al., 2020;Wang et al., 2020b), TextEssence can be used to study the different regularities captured by competing algorithms, providing insight into the utility of different approaches. TextEssence also can function as a tool for studying the properties of different terminologies for domain concepts, something not previously explored in the computational literature.
In addition, the TextEssence interface can provide utility for other types of analyses as well. For example, the Inspect and Compare portions of the interface could be used to interact with topic models learned from different corpora. These components are largely agnostic to the nature of the underlying data, and could be extended for studying a variety of different types of NLP models.

Confidence estimation in embedding analysis
The relatively constrained size of corpora in our analysis motivated our novel embedding confidence measure, to help separate differences due to random effects in embedding training from differences in concept usage patterns. We used a fixed confidence threshold for our analyses; however, increasing or decreasing the threshold for high-confidence embeddings will affect both the set of reported neighbors and the visualization of the embedding space, and can inform the user of TextEssence which observations are more or less stable. We highlight varying this threshold as an important area for future investigation with TextEssence.
More broadly, prior work by Wendlandt et al. (2018), Antoniak and Mimno (2018), and Gonen et al. (2020), among others, has also shown embedding stability to be a concern in models trained on larger corpora than those used in this work. However, the role of random embedding effects on previous qualitative studies using word embeddings (e.g., Kulkarni et al. (2015), Hamilton et al. (2016)) has not been evaluated. A broader investigation of embedding confidence measures in qualitative studies will be invaluable in the continued development of embedding technologies as a tool for linguistics research.

Conclusion
TextEssence is an interactive tool for comparative analysis of word and concept embeddings. Our implementation and experimental code is available at https://github.com/drgriffis/ text-essence, and the database derived from our CORD-19 analysis is available at https:// doi.org/10.5281/zenodo.4432958. A screencast of TextEssence in action is available at https://youtu.be/1xEEfsMwL0k. All associated resources for TextEssence may be found at https://textessence.github.io.