LINSPECTOR WEB: A Multilingual Probing Suite for Word Representations

We present LINSPECTOR WEB , an open source multilingual inspector to analyze word representations. Our system provides researchers working in low-resource settings with an easily accessible web based probing tool to gain quick insights into their word embeddings especially outside of the English language. To do this we employ 16 simple linguistic probing tasks such as gender, case marking, and tense for a diverse set of 28 languages. We support probing of static word embeddings along with pretrained AllenNLP models that are commonly used for NLP downstream tasks such as named entity recognition, natural language inference and dependency parsing. The results are visualized in a polar chart and also provided as a table. LINSPECTOR WEB is available as an offline tool or at https://linspector.ukp.informatik.tu-darmstadt.de.


Introduction
Natural language processing (NLP) has seen great progress after the introduction of continuous, dense, low dimensional vectors to represent text.The field has witnessed the creation of various word embedding models such as monolingual (Mikolov et al., 2013), contextualized (Peters et al., 2018), multi-sense (Pilehvar et al., 2017) and dependency-based (Levy and Goldberg, 2014); as well as adaptation and design of neural network architectures for a wide range of NLP tasks.Despite their impressive performance, interpreting, analyzing and evaluating such black-box models have been shown to be challenging, which even led to a set of workshop series (Linzen et al., 2018).
Early works for evaluating word representations (Faruqui and Dyer, 2014a,b;Nayak et al., 2016) have mostly focused on English and used either the word similarity or a set of downstream tasks.However datasets for either of those tasks do not exist for many languages, word similarity tests do not necessarily correlate well with downstream tasks and evaluating embeddings on downstream tasks can be too computationally demanding for low-resource scenarios.To address some of these challenges, Shi et al. (2016); Adi et al. (2017); Veldhoen et al. (2016); Conneau et al. (2018) have introduced probing tasks, a.k.a.diagnostic classifiers, that take as input a representation generated by a fully trained neural model and output predictions for a linguistic feature of interest.Due to its simplicity and low computational cost, it has been employed by many studies summarized by Belinkov and Glass (2019), mostly focusing on English.Unlike most studies, Köhn (2015) introduced a set of multilingual probing tasks, however its scope has been limited to syntactic tests and 7 languages.More importantly it is not accessible as a web application and the source code does not have support to probe pretrained downstream NLP models out of the box.
Given the above information, most of the lowerresource non-English academic NLP communities still suffer from (1) the amount of required human and computational resources to search for the right model configuration, and (2) the lack of diagnostics tools to analyze their models to gain more insights into what is captured.Recently, S ¸ahin et al. (2019) proposed 16 multilingual probing tasks along with the corresponding datasets and showed that they correlate well with certain downstream task performances.In this paper, we employ these datasets to develop LINSPECTOR WEB that is designed to help researchers with lowresources working on non-English languages to (1) analyze, interpret, and visualize various layers of their pretrained AllenNLP (Gardner et al., 2018)  the performance of static word embeddings for language-specific linguistic properties.To the best of our knowledge, this is the first web application that (a) performs online probing; (b) enables users to upload their pretrained downstream task models to automatically analyze different layers and epochs; and (c) have support for more than 20 languages.

Previous Systems
A now retired evaluation suite for word embeddings was wordvectors.org(Faruqui and Dyer, 2014a).The tool provided evaluation and visualization for antonyms, synonyms, and female-male similarity; and later it was updated to support German, French, and Spanish word embeddings (Faruqui and Dyer, 2014b).For a visualization the user could enter multiple tokens and would receive a 2 dimensional chart to visualize the cosine distance between the tokens.Therefore it was limited by the amount of tokens, a human could enter and analyze.VecEval (Nayak et al., 2016) is another web based suite for static English word embeddings that perform evaluation on a set of downstream tasks which may take several hours.The visualization is similar to LINSPEC-TOR WEB reporting both charts and a table.Both web applications do not support probing of intermediate layers of pretrained models or the addition of multiple epochs.Köhn (2015) introduced an offline, multilingual probing suite for static embeddings limited in terms of the languages and the probing tasks.A comparison of the system features of previous studies is given in Table 1.

LINSPECTOR WEB
Our system is targeted at multilingual researchers working with low-resource settings.It is designed as a web application to enable such users to probe their word representations with minimal effort and computational requirement by simply uploading a file.The users can either upload their pretrained static embeddings files (e.g.word2vec (Mikolov et al., 2013), fastText (Bojanowski et al., 2016), GloVe (Pennington et al., 2014)); 1 or their pretrained archived AllenNLP models. 2 In this version, we only give support to AllenNLP, due its high usage rate by low-resource community and being up-to-date, i.e., containing state-of-the-art models for many NLP tasks and being continuously maintained at the Allen Institute for Artificial Intelligence (Gardner et al., 2018).

Scope of Probing
We support 28 languages from very diverse language families. 3 The multilingual probing datasets (S ¸ahin et al., 2019) used in this system are language-specific, i.e., languages with a gender system are probed for gender, whereas languages with a rich case-marking system are probed for case.The majority of the probing tasks probe for morpho-syntactic properties (e.g.case, mood, person) which have been shown to correlate well with syntactic and semantic parsing for a number of languages, where a small number of tasks probe for surface (e.g.word length) or semantic level properties (e.g.pseudoword).Finally, there are two morphological comparison tasks (Odd-/Shared Morphological Feature) aiming to find the unique distinct/shared morphological feature between two tokens, which have been shown to correlate well with the NLI task.The current probing tasks are type-level (i.e.do not contain ambiguous words) and are filtered to keep only the frequent words.These tasks are (1) domain independent and (2) contain valuable information encoded via subwords in many languages (e.g. the Turkish word gelemeyenlerden "he/she is one of the folks 1 Since our probing datasets are publicly available, fastText embeddings for unknown words in our dataset can be generated by the user locally via the provided functions in (S ¸ahin et al., 2019).
2 Such archives are generated for free when using a serialization directory during training.

Features: Models, Layers and Epochs
We support the following classifier architectures implemented by AllenNLP: BiaffineDependen-cyParser (Dozat and Manning, 2016), CrfTagger (Sutton et al., 2007), SimpleTagger (Gardner et al., 2018), ESIM (Chen et al., 2017).Bi-affineDependencyParser and CrfTagger are highlighted as the default choice for dependency parsing and named entity recognition by (Gardner et al., 2018), while ESIM was picked as one of two available natural language inference models, and SimpleTagger support was added as the entry level AllenNLP classifier to solve tasks like partsof-speech tagging.
The users can choose the layers they want to probe.This allows the users to analyze what linguistic information is captured by different layers of the model (e.g., POS information in lower layers, semantic information in higher levels).It is possible to select any AllenNLP encoder layer for classifiers with token, sentence, or document based input and models with dual input (e.g.ESIM: premise, hypothesis) that allow probing of selected layers depending on their internal architecture as described in Sec.4.2.Additionally a user can specify up to 3 epochs for probing to inspect what their model learns and forgets during training.This is considered a crucial feature to diagnose the pretrained models.For instance, a user diagnosing a pretrained NLI task, can probe for the tasks that have been shown to correlate well (Mood, Person, Polarity, and Tense) (S ¸ahin et al., 2019) for additional epochs, and analyze how their performance evolves during training.After the diagnostic classifiers are trained and tested on the specified language, model, layer, and epochs, the users are provided with (1) accuracies of each task visualized in a polar chart, (2) a table containing accuracy and loss for each probing test, and (3) in case of additional epochs, accuracies for other epochs are overlaid on the chart and columns are added to the table for easy comparison as shown in Fig. 2-Right.
The uploaded model files are deleted immediately after probing, however the results can be saved or shared via a publicly accessible URL.
The project is open source and easily extendable to additional languages, probing tasks and Al-lenNLP models.New languages can be added simply by adding train, dev, and test data for selected probing tasks and adding one database entry.Similarly new probing tasks can be defined.In case the new tasks differ by input type, a custom AllenNLP dataset reader and classifier should be added.It can be extended to new AllenNLP models by adding the matching predictor to the supported list or writing a custom predictor if the model requires dual input values (e.g.ESIM).Finally, other frameworks (e.g.ONNX format) can be supported by adding a method to extract embeddings from the model.

System Description
LINSPECTOR WEB is based on the Python Django framework 4 which manages everything related to performance, security, scalability, and database handling.

Frontend
First, the user selects the language of the model and a number of probing tests they want to perform.The probing test selection menu will vary with the selected language.Next the user has to upload an archived AllenNLP model or a static embeddings file.The input pipeline is shown in Fig. 1.The upload is handled asynchronously using custom AJAX code to support large files, prevent timeouts, and give the user some progress Figure 3: Backend architecture feedback.The backend detects if an uploaded file is an archived AllenNLP model and provides a list of layers if that is the case as shown in Fig. 2-Left.Probing is handled asynchronously by the backend.An JSON API endpoint gives progress feedback to the frontend which displays a progress bar and the currently executed probing test to the user.Finally results are displayed in an interactive chart and a table.For the user interface, we use the Bootstrap framework5 that provides us with modern, responsive, and mobile compatible HTML and CSS.The visualization is done using the Highcharts library.6

Backend
The structure of the backend system is shown in Fig. 3 and the main components are explained below.
Layers: To get a list of layers an archived Al-lenNLP model is loaded using a standard Al-lenNLP API.Since every AllenNLP classifier inherits from the PyTorch (Paszke et al., 2017) class torch.nn.Module, we can get a list of submodules using the named children API.First we extract high level AllenNLP modules including all Seq2SeqEncoder, Seq2VecEncoder, and FeedForward modules by testing each submodule for a get input dim() method.Then we extract low level modules which can be either AllenNLP modules e.g.AugmentedLstm or PyTorch modules e.g.Linear by testing for the attributes input size or in features.All those modules are then returned as available probing layers.We require the input dimension later and since there is no standard API we have to exclude some submodules.Also some classifiers take dual input values.Since we only provide a single input value we cannot probe layers after the input is combined.For example we can only probe ESIM at the first encoder layer, before both inputs (premise and hypothesis) are concatenated.
Getting Embeddings: PyTorch modules allow us to register forward hooks.A hook is a callback which receives the module, input, and output every time an input is passed through the module.For AllenNLP models we register such a callback to the selected encoding layer.Then each time a token is passed through the model, it passes through the encoder and the callback receives the input vector.The most reliable way to pass tokens through a model is using AllenNLP predictors.There is a matching predictor for every model which are regularly tested and updated.We gather all tokens from our intrinsic probing data and pass it through the predictor.For every token the forward hook is called in the background which then provides us with the vector.The token and vector are then written to a temporary file.During the embedding extraction, the progress is reported back to the frontend periodically in 30 steps.For static embeddings all lines that match the embedding dimension are written to a temporary file and malformed data is removed.
Probing: Finally the gathered embeddings are loaded as a pretrained non-trainable embedding layer for a single linear layer custom build Al-lenNLP classifier.The classifier is trained and evaluated using the intrinsic probing data for the specified probing test.We use 20 epochs with early stopping, a patience of 5, and gradient clipping of 0.5.The evaluation accuracy and loss are then returned.For contrastive probing tasks (Odd-/Shared Morphological Feature) a similar linear classifier that takes concatenated tokens as input, is used.
Asynchronous probing is handled using the Python Celery framework 7 , the RabbitMQ message broker 8 , and the Eventlet execution pool. 9When the user starts probing, a new Celery task is created in the backend which executes all probing tasks specified by the user asynchronously and reports the progress back to the frontend.Finally the results are saved in a PostgreSQL 10 or SQLite 11 database using the Django Celery Results application. 12

Tests
We have trained BiaffineDependencyParser, Crf-Tagger, SimpleTagger, and ESIM AllenNLP models for Arabic, Armenian, Czech, French, and Hungarian with varying dimensions.We have tested the intrinsic probing data, layer selection, consistency of metrics, contrastive and noncontrastive classifiers, and all probing tests for multiple combinations of languages, dimensions, and AllenNLP models.Static embeddings are tested using pretrained fastText files for the same languages.In addition, the file upload was tested with files up to 8 GB over a DSL 50k connection.

Training Times
The LINSPECTOR WEB server is hosted in university data center with a state-of-the-art internet connection which allows for fast upload speeds.Therefore, the overall upload speed mostly depends on the users connection.For a single probing task, embedding extraction, training, and evaluation is around a few minutes. 13

Conclusion
Researchers working on non-English languages under low-resource settings have lacked a tool that would assist with model selection via providing linguistic insights, to this date.To address this, we presented LINSPECTOR WEB, an open source, web-based evaluation suite with 16 probing tasks for 28 languages; which can probe pretrained static word embeddings and various layers of a number of selected AllenNLP models.The tool can easily be extended for additional languages, probing tasks, and AllenNLP models.LINSPECTOR WEB is available at https://linspector.ukp.informatik.tu-darmstadt.de and the source code for the server is released with https://github.com/UKPLab/linspector-web along with the installation instructions for the server.Probing tasks and the system will be extended to support contextual probing in near future.

Figure 2 :
Figure 2: Left: Layer selection example, Right: Polar chart result shown for different epochs for pretrained Arabic BiaffineDependencyParser.
models at different epochs and (2) measure