Going Beyond T-SNE: Exposing whatlies in Text Embeddings

We introduce whatlies, an open source toolkit for visually inspecting word and sentence embeddings. The project offers a unified and extensible API with current support for a range of popular embedding backends including spaCy, tfhub, huggingface transformers, gensim, fastText and BytePair embeddings. The package combines a domain specific language for vector arithmetic with visualisation tools that make exploring word embeddings more intuitive and concise. It offers support for many popular dimensionality reduction techniques as well as many interactive visualisations that can either be statically exported or shared via Jupyter notebooks. The project documentation is available from https://rasahq.github.io/whatlies/.


Introduction
The use of pre-trained word embeddings (Mikolov et al., 2013a;Pennington et al., 2014) or language model based sentence encoders (Peters et al., 2018;Devlin et al., 2019) has become a ubiquitous part of NLP pipelines and end-user applications in both industry and academia. At the same time, a growing body of work has established that pre-trained embeddings codify the underlying biases of the text corpora they were trained on (Bolukbasi et al., 2016;Garg et al., 2018;Brunet et al., 2019). Hence, practitioners need tools to help select which set of embeddings to use for a particular project, detect potential need for debiasing and evaluate the debiased embeddings. Simplified visualisations of the latent semantic space provide an accessible way to achieve this.
Therefore we created whatlies, a toolkit offering a programmatic interface that supports vector arithmetic on a set of embeddings and visualising the space after any operations have been carried out. For example, Figure 1 shows an example Figure 1: Projections of wking, wqueen, wman, wqueen − wking and wman projected away from wqueen − wking. Both the vector arithmetic and the visualisation were done using the whatlies. The support for arithmetic expressions is integral in whatlies because it leads to more meaningful visualisations and concise code. of how representations for queen, king, man, and woman can be projected along the axes v queen−king and v man|queen−king in order to derive a visualisation of the space along the projections.
Perhaps the most widely known tool for visualising embeddings is the tensorflow projector 1 which offers 3D visualisations of any input embeddings. The visualisations are useful for understanding the emergence of clusters and the neighbourhood of certain words and the overall space. However, the projector is limited to dimensionality reduction as the sole preprocessing method. More recently, Molino et al. (2019) have introduced parallax which allows explicit selection of the axes on which to project a representation. This creates an additional level of flexibility as these axes can also be derived from arithmetic operations on the embeddings.
The major difference between the tensorflow pro-jector, parallax and whatlies is that the first two provide a non-extensible browser-based interface, whereas whatlies provides a programmatic one. Therefore whatlies can be more easily extended to any specific practical need and cover individual use-cases. The goal of whatlies is to offer a set of tools that can be used from a Jupyter notebook with a range of visualisation capabilities that goes beyond the commonly used static T-SNE (van der Maaten and Hinton, 2008) plots. whatlies can be installed via pip, the code is available from https://github.com/RasaHQ/ whatlies 2 and the documentation is hosted at https://rasahq.github.io/whatlies/.

What lies in whatlies -Usage and Examples
Embedding backends. The current version of whatlies supports word-level as well as sentence-level embeddings in any human language that is supported by the following libraries: • BytePair embeddings (Sennrich et al., 2016) via the BPemb project (Heinzerling and Strube, 2018) • fastText (Bojanowski et al., 2017) • gensim (Řehůřek and Sojka, 2010) • huggingface (Wolf et al., 2019) • sense2vec (Trask et al., 2015); via spaCy Retrieved embeddings are python objects that contain a vector and an associated named. It comes with extra utility methods attached that allow for easy arithmetic and visualisation.
The library is capable of retreiving embeddings for sentences too. In order to retrieve a sentence representation for word-level embeddings such as fastText, whatlies returns the summed representation of the individual word vectors. For pretrained encoders such as BERT (Devlin et al., 2019) or ConveRT (Henderson et al., 2019), whatlies uses its internal [CLS] token for representing a sentence.

for the result :)
This feature allows users to construct custom queries and use it e.g. in combination with the similarity retrieval functionality. For example, we can validate the widely circulated analogy of Mikolov et al. (2013b)  Excluding the query word king 5 , the analogy returns the anticipated result: queen.
The library also allows the user to add/subtract embeddings but also project unto (via the > operator) or away from them (via the | operator). This means that the user is very flexible when it comes to retrieving embeddings.
Multilingual Support. whatlies supports any human language that is available from its current list of supported embedding backends. This allows us to check the royal analogy from above in languages other than English. The code snippet below shows the results for Spanish and Dutch, using pre-trained fastText embeddings 6 . from whatlies.language import \ FasttextLanguage es = FasttextLanguage("cc.es.300.bin") nl = FasttextLanguage("cc.nl.300.bin") emb_es = es["rey"] -es["hombre"] + \ es["mujer"] emb_nl = nl["koning"] -nl["man"] + \ nl ["vrouw"]  While for Spanish, the correct answer reina is only at rank 3 (excluding rey from the list), the second ranked monarca (female form of monarch) is getting close. For Dutch, the correct answer koningin is at rank 2, surpassed only by koningen (plural of king). Another interesting observation is that the cosine distances -even of the query words -vary wildly in the embeddings for the two languages.
is driven by spaCy currently.

emb = lang[words]
It is often more useful to analyse a set of embeddings at once, rather than many individual ones. Therefore, any arithmetic operations that can be applied to single embeddings, can also be applied to all of the embeddings in a given set.
The emb variable in the previous code example represents an EmbeddingSet. These are collections of embeddings which can be simpler to analyse than many individual variables. Users can, for example, apply vector arithmetic to the entire EmbeddingSet. whatlies also offers interactive visualisations using "Altair" as a plotting backend 7 : emb.plot_interactive(x_axis="man", y_axis="yellow", show_axis_point=True) The above code snippet projects every vector in the EmbeddingSet onto the vectors on the specified axes. This creates the values we can use for 2D visualisations. For example, given that man is on the x-axis the value for 'yellow' on that axis will be: v(yellow → man) = w yellow · w man w man · w man which results in Figure 3. These plots are built on top of Altair (VanderPlas et al., 2018) and are fully interactive. It is possible to click and drag in order to navigate through the embedding space and zoom in and out. These plots can be hosted on a website but they can also be exported to png/svg for publication. It is furthermore possible to apply any vector arithmetic operations for these plots, resulting in   Transformations. whatlies also supports several techniques for dimensionality reduction of EmbeddingSets prior to plotting. This is demonstrated in Figure 5 below. from whatlies.transformers import Pca from whatlies.transformers import Umap p1 = (emb .transform(Pca (2)) .plot_interactive()) p2 = (emb .transform(Umap (2)) .plot_interactive()) p1 | p2 Transformations in whatlies are slightly different than for example scikit-learn transformations because in addition to dimensionality reduction, the transformation can also add embeddings that represent each principal component to the EmbeddingSet object. As a result, they can be referred to as axes for creating visualisations as seen in Figure 5.
Scikit-Learn Integration. To facilitate quick exploration of different word embeddings we have also made our library compatible with scikitlearn (Pedregosa et al., 2011). The Rasa library uses numpy (Harris et al., 2020) to represent the numerical vectors associated to the input text. This means that it is possible to use the whatlies embedding backends as feature extractors in scikitlearn pipelines, as the code snippet below shows 8 : from whatlies.language import \ BytePairLanguage from sklearn.pipeline import Pipeline pipe = Pipeline([ ("embed", BytePairLanguage("en")), ("model", LogisticRegression()) ]) X = [ "i really like this post", "thanks for that comment", "i enjoy this friendly forum", "this is a bad post", "i dislike this article", "this is not well written" ] y = np.array ([1, 1, 1, 0 Subsequently, the new EmbeddingSet can be visualised as a distance map as in Figure 6, revealing a number of spurious correlations that suggest a gender bias in the embedding space.

emb_of_pairs.plot_distance(metric="cosine")
Visualising issues in the embedding space like this creates an effective way to communicate potential risks of using embeddings in production to non-technical stakeholders. It is possible to apply the debiasing technique introduced by Bolukbasi et al. (2016) in order to approximately remove the direction corresponding to gender. The code snippet below achieves this by, again, using the arithmetic notation.   It is important to note though, that the above technique does not reliably remove all relevant bias in the embeddings and that bias is still measurably existing in the embedding space as Gonen and Goldberg (2019)  As the output shows, the neighbourhoods of maid in the biased and debiased space are almost equivalent, with e.g. mistress still appearing relatively high-up the nearest neighbours list.
Comparing Embedding Backends. Another use-case for whatlies is for comparing different embeddings. For example, we wanted to analyse two different encoders for their ability to capture the intent of user utterances in a task-based dialogue system. We compared spaCy and the Universal Sentence Encoder for their ability to embed sentences from the same intent class close together in space. Figure 8 shows that the utterances encoded with the Universal Sentence Encoder form more coherent clusters. Sentence Encoder for embedding example sentences from 3 different intent classes. Universal Sentence Encoder embeds the sentences into relatively tight and coherent clusters, whereas class boundaries are more difficult to see with spaCy. Figure 9 highlights the same trend with a distance map, where for spaCy there is barely any similarity between the utterances, the coherent clusters from Figure 8 are well reflected in the distance map for the Universal Sentence Encoder.
The superiority of Universal Sentence Encoder in comparison to spaCy for this example is expected, though, as it is aimed at sentences, but it is certainly useful to have a toolwhatlies -at one's disposal with which it is possible to quickly validate this.

Roadmap
whatlies is in active development. While we cannot predict the contents of future community PRs, this is our current roadmap for future development: • We want to make it easier for people to research bias in word embeddings. We will continue to investigate if there are visualisation techniques that can help spot issues and we aim to make any robust debiasing techniques available in whatlies.
• We would like to curate labelled sets of word lists for attempting to quantify the amount of bias in a given embedding space. Properly labelled word lists can be useful for algorithmic bias research but it might also help understand clusters. We plan to make any evaluation resources available via this package.
• One limit of using Altair as a visualisation library is that we cannot offer interactive visualisations with many thousands of data points. We might explore other visualisation tools for this library as well.
• Since we're supporting dynamic backends like BERT at the sentence level, we are aiming to also support these encoders at the word level, which requires us to specify an API for retrieving contextualised word representations within whatlies. We are currently exploring various ways for exposing this feature and are working with a notation that uses square brackets that can select an embedding from the context of the sentence that it resides in: At the moment we only support spaCy backends with this notation but we plan to explore this further with other embedding backends. 10 • A related issue is that not every vocabulary based back-end uses the same method of pooling word-embeddings to represent a sentence. Some take the sum, while others take the mean and others introduce yet another standard. Our goal for vocabulary based back-ends is to al-10 Ideally we also introduce the necessary notation for retrieving the contextualised embedding from a particular layer, e.g. lang ['bank'][2] for obtaining the representation of bank from the second layer of the given language model. low the user to control this manually for consistency.

Conclusion
We have introduced whatlies, a python library for inspecting word and sentence embeddings that is very flexible due to offering a programmable interface. We currently support a variety of embedding models, including fastText, spaCy, BERT, or the Universal Sentence Encoder. This paper has showcased its current use as well as plans for future development. The project is hosted at https://github.com/RasaHQ/whatlies and we are happy to receive community contributions that extend and improve the package.