Morph Call: Probing Morphosyntactic Content of Multilingual Transformers

The outstanding performance of transformer-based language models on a great variety of NLP and NLU tasks has stimulated interest in exploration of their inner workings. Recent research has been primarily focused on higher-level and complex linguistic phenomena such as syntax, semantics, world knowledge and common-sense. The majority of the studies is anglocentric, and little remains known regarding other languages, specifically their morphosyntactic properties. To this end, our work presents Morph Call, a suite of 46 probing tasks for four Indo-European languages of different morphology: Russian, French, English and German. We propose a new type of probing tasks based on detection of guided sentence perturbations. We use a combination of neuron-, layer- and representation-level introspection techniques to analyze the morphosyntactic content of four multilingual transformers, including their understudied distilled versions. Besides, we examine how fine-tuning on POS-tagging task affects the probing performance.


Introduction
In the last few years, transformer language models (Vaswani et al., 2017) have accelerated the growth in the field of NLP. The models have established new state-of-the-art results in multiple languages and even demonstrated superiority in NLU benchmarks compared to human solvers (Raffel et al., 2020;Xue et al., 2020;He et al., 2020). Their distilled versions, or so-called student models, have shown competitive performance on many NLP tasks while having fewer parameters (Tsai et al., 2019). However, many questions remain on how these models work and what they know about language. The previous research focuses on what knowledge has been learned during and after pre-training phases (Chiang et al., 2020;Rogers et al., 2020a), and how it is affected by fine-tuning (Gauthier and Levy, 2019;Peters et al., 2019;Miaschi et al., 2020;Merchant et al., 2020). Besides, a wide variety of language phenomena has been investigated including syntax (Hewitt and Manning, 2019a;Liu et al., 2019a), world knowledge (Petroni et al., 2019;Jiang et al., 2020), reasoning (van Aken et al., 2019), common sense understanding Klein and Nabi, 2019), and semantics (Ettinger, 2020).
Most of these studies involve probing which measures how well linguistic knowledge can be inferred from the intermediate representations of the model. The methods range from individual neuron analysis (Dalvi et al., 2020;Durrani et al., 2020a), examination of attention mechanisms (Kovaleva et al., 2019;Vig and Belinkov, 2019), correlationbased similarity measures (Wu et al., 2020), to probing tasks accompanied by linguistic supervision (Adi et al., 2016;Conneau et al., 2018).
Despite growing interest in interpreting the models, morphology has remained understudied, specifically for languages other than English. The majority of prior works on this subject are devoted to the introspection of machine translation models, word-level embedding models, or transformers, fine-tuned for POS-tagging (see Section 2).
To this end, we introduce Morph Call, a probing suite for the exploration of morphosyntactic content in transformer language models. The contributions of this paper are summarized as follows. First, we propose 46 probing tasks in four Indo-European languages of different morphology: Russian, French, English, and German. Inspired by techniques for model acceptability judgments (Warstadt et al., 2019a) and adversarial training (Alzantot et al., 2018;Tan et al., 2020b,c), we present a new type of probing tasks based on the detection of guided sentence perturbations. Since the latter is automatically generated, the tasks can be adapted to other languages. Second, we use complementary probing methods to analyze four multilingual transformer encoders, including their distilled versions. We examine how fine-tuning for POS-tagging affects the probing performance and establish count-based and non-contextualized baselines for the tasks. Finally, we publicly release the tasks and code 1 , hoping to fill the gaps in the less studied aspect of transformers.

Related Work
A large body of recent research is devoted to analyzing and interpreting the linguistic capacities of pre-trained contextualized encoders. The most common approach is to train a simple classifier for solving a probing task over the word-or sentencelevel features produced by the models (Conneau et al., 2018;Liu et al., 2019a). The classifier's performance is used as a proxy to assess the model knowledge about a particular linguistic property. However, lately, the method has been critiqued: is the property truly learned by the model, or does the model encode the property for the classifier to easily extract it given the supervision? Besides, a new set of additional classifier parameters can make it challenging to interpret the results (Hewitt and Liang, 2019;Hewitt and Manning, 2019b;Saphra and Lopez, 2019;Voita and Titov, 2020).
Nevertheless, the probing classifiers are widely applied in the field of model interpretation, including morphology. One of the first works on morphological content is carried out on machine translation models where the classifier is learned to predict POS-tags in multiple languages (Belinkov et al., 2017(Belinkov et al., , 2018. The latest studies involving POS properties in transformers show that they are predominantly captured at the lower layers (Tenney et al., 2019b;Liu et al., 2019b;Rogers et al., 2020a), and can be evenly distributed across all layers (Durrani et al., 2020b). Amnesic probing explores how removing information at a particular layer affects the probe performance at the final layer (Elazar et al., 2020). This allows measuring the layer importance with respect to a linguistic 1 https://github.com/ morphology-probing/morph-call property. The results claim that removing POS information may affect the performance more at the higher layers as compared to the lower ones.
Another line of research is devoted to various linguistic phenomena at the juxtaposition of morphology, syntax, and semantics. LSTM-based models and transformers are probed to capture subjectverb agreement in different languages (Linzen et al., 2016;Giulianelli et al., 2018;Ravfogel et al., 2018;Goldberg, 2019). Recently, the agreement has been at the core of inflectional perturbations for adversarial training (Tan et al., 2020a), and linguistic acceptability judgments along with morphological, syntactic, and semantic violations (Warstadt et al., 2019b).
Our work is closely related to (Edmiston, 2020) who explore morphological properties and subjectverb agreement in the hidden representations and self-attention heads of transformer models. However, there are several differences. First, we investigate the knowledge in multilingual transformers and their distilled versions instead of monolingual ones. Second, we carry out the experiments on an extended set of tasks, such as detecting syntactic and inflectional perturbations (see Section 3.2). Third, we apply several probing methods to analyze from different perspectives. Finally, we study the impact of fine-tuning for POS-tagging on the probe performance. Despite the similarities and differences, we find the studies complementary.
Finally, such benchmarks as LINSPECTOR (Şahin et al., 2020) and XTREME (Hu et al., 2020) provide means for evaluation of multilingual embedding models and cross-lingual transferring methods with regards to multiple linguistic properties, specifically morphology.

Morphosyntactic Inventories
This paper investigates four Indo-European languages that fall under different morphological types: Russian, French, English, and German. Russian and French have fusional morphology, while English is an analytic language, and German exhibits peculiarities of fusional and agglutinative types. We consider the nominal morphosyntactic features of Number, Case, Person, and Gender. Even though the feature inventory is mostly shared across the languages, the latter differ significantly in their richness of morphology (Baerman, 2007).
The morphosyntactic inventories of the analyzed languages are outlined in Table 1.

Probing Tasks
Data We use sentences from the Universal Dependencies (UD) (Nivre et al., 2016) for all our probing tasks, keeping in mind possible inconsistency between the Treebanks (de Marneffe et al., 2017;Alzetta et al., 2017;Droganova et al., 2018), and consequent inconsistency in dataset sizes across languages. All sentences are filtered by a 5-to-25 token range, and each task is split into 80/10/10 train/val/test partitions with no sentence overlap. The partitions are balanced by the number of instances per target class. Notably, the availability of the UD Treebanks in different languages allows for an adaptation of the method to the other ones. The used Treebanks are listed in Appendix A, and a brief statistics of the tasks is presented in Appendix B.
Task Description We construct four groups of probing tasks framed as binary or multi-class classification tasks: Morphosyntactic Features, Masked Token, Morphosyntactic Values and Perturbations.
Morphosyntactic Features probe the encoder for the occurrence of the morphosyntactic properties. The goal is to detect if a word exhibits a particular property based on its contextualized representation. Consider an example for the Russian sentence 'The clock stopped in a month.': Here, the target words are indicated by bold, and the labels denote if they have the category of Number.
Masked Token tasks are analogous to Morphosyntactic Features with the exception that the target word is replaced with a tokenizer-specific mask token. The tasks test if it is possible to recover the properties of the masked token purely from the context. Below is an example where the sentence mentioned above 'The clock stopped in a month.' contains masked target words, and labels denote the occurrence of the Number feature at the position of the token: Morphosyntactic Values is a group of k-way classification tasks for each feature where k is the number of values that the feature can take (see Table  1). For instance, the goal is to identify whether the word girl is in the singular or plural form: 'The girl has either pink or brown.' Perturbations tasks test the encoder sensitivity to various sentence perturbations. Removing words from a text has recently been used to obtain adversarial attacks (Liang et al., 2017;Li et al., 2018), whereas inflectional perturbations have been applied for adversarial training of transformers (Tan et al., 2020b,c). In contrast, we extend the perturbations to probe the encoders for linguistic knowledge. To this end, we construct eight tasks that involve syntactic perturbations and inflectional perturbations in the subject-predicate agreement and deictic words. Note that we apply a set of languagespecific rules to control the quality of the error generation procedure. To obtain the inflectional candidates, we make use of pymorphy2 for Russian (Korobov, 2015), lemminflect 2 for English, and word paradigm tables from Wiktionary for French 3 and German 4 .
Stop-words Removal involves corruption of a syntax tree by removing stop-words. We use lists of stop-words provided by NLTK library (Loper and Bird, 2002). Consider an example of the French sentence 'Les Irakiens ont tout détruit à le Koweit', where the bolded words correspond to the removed stop-words.
Article Removal is a special case of the previous task, revealing whether the encoders are sensitive to discarded articles. This task is only constructed for French, English, and German. Note that such perturbation may also strain the semantics of the sentence: 'It's on loan, by the way'.
Subject Number includes inflectional perturbations of the subject in the main clause with respect to the Number: 'The girls has either pink or brown.'  ves'ma odnoobrazen .
Predicate Person comprises perturbations in the Person form of the predicate in the main clause. For instance, the Russian sentence Ya poedu v Moskvu 'I will go to Moscow' contains the perturbed predicate in the form of the second Person instead of the first one: Deictic Word Number involves perturbations generated by the inflection of demonstrative pronouns (only in English and German). For example, the singular form of the pronoun dieser 'this' is changed to the plural form diesen 'these' in the sentence Siehe zu dieser Technik auch 'See also this technique': Siehe zu diesen this+PL+DAT Technik auch . Each model under investigation has two instances for each language: 1. Fine-tuned model is a transformer model fine-tuned for POS-tagging. We use the UD Treebanks and HuggingFace library for fine-tuning. The data is randomly split into 80/10/10 train/val/test sets.
2. Pre-trained model is a non-tuned transformer model with frozen weights.

Probing Methods
Probing Classifiers We use Logistic Regression from scikit-learn library (Pedregosa et al., 2011) as a probing classifier. The classifier is trained over hidden representations 5 produced by the encoders with the regularization parameter L 2 ∈ [0.25, 0.5, 1, 2, 4] tuned on the validation set. The performance is evaluated by the ROC-AUC score.

Neuron Analysis
The neuron-level analysis allows retrieving a group of individual neurons that are most relevant to predict a linguistic property (Durrani et al., 2020a). Similarly, a linear classifier is trained over concatenated mean-pooled word/sentence representations using Elastic-net regularization (Zou and Hastie, 2005), and with L 1 and L 2 λ's ∈ [0.1, . . . , 1e −5 ] tuned on the validation set. The weights of the classifier are used to measure the relevance of each neuron.
Correlation Analysis Canonical correlation analysis (ckasim) is a representation-level similarity measure that allows identifying pairs of layers of similar behavior (Wu et al., 2020). We use [CLS]-pooled intermediate representations to analyze the encoders. The measure is computed with the help of the publicly available code 6 .

Baselines
We train Logistic Regression over the following count-based and distributive baseline features (see Section 4.2). We use N-gram range ∈ [1, 4] for each count-based baseline. Countbased features include Char Number (length of a word/sentence in characters), TF-IDF over character N-grams, TF-IDF over BPE tokens (Bert-Tokenizer), and TF-IDF over SentencePiece tokens (XLMRobertaTokenizer). We use multilingual tokenizers by HuggingFace library to split words/sentences into the sub-word tokens. The distributive baseline is mean-pooled monolingual

Masked Token
Probing Classifiers The results of the probing classifier performance on Masked Token tasks are presented in Tables 8 (pre-trained models) and 9 (fine-tuned models) (see Appendix D). The task has appeared to be more challenging as opposed to Morphosyntactic Features (see Section 5.1). An interesting observation in this setting is that the performance of the models predominantly drops or becomes unstable after fine-tuning. For instance, BERT may lose almost 10% in the tasks for Russian, and D-BERT may drop 5% in the tasks for French. The probing curves tend to show rapid increases and decreases across the layers. An exception to this pattern is XLM-R which is less affected by fine-tuning and exhibits a more stable probing behavior. Nevertheless, the models demonstrate their capability to infer the properties from the context. XLM-R makes correct predictions in almost 70% of cases, while the performance of M-BERT and D-BERT is slightly worse, and MiniLM may struggle the most. Figure 1 outlines the results on Case task for Russian, best solved among the others. The middle-to-higher layers account for more correct predictions in the models of both instances. However, the higher layers [10 − 12] of 12-layer models and layer [6] of D-BERT may pertain to lower performance. A possible explanation is that the layers are affected by the objectives, i.e., Masked Language Modeling (pre-trained) or POStagging (fine-tuned). We find that the contextualized representations of a masked token produced by the final layers of pre-trained models may store the morphosyntactic properties. The probing curves demonstrate that the distribution of the properties may get affected by fine-tuning, or the knowledge can be partially lost, which is shown by the performance drops.

Morphosyntactic Values
Property-wise Neuron Analysis We apply property-wise neuron analysis to investigate the top-neurons per each morphosyntactic property (see Section 4.2). We find that some models require a larger group of neurons to learn a morphosyntactic property, and the number of these neurons may get changed after fine-tuning. We provide the results for each language in Appendix E.

Perturbations
Probing Classifiers The results of the probing classifier performance on Perturbations tasks are presented in Table 10 (pre-trained models), and Table 11 (fine-tuned models) (see Appendix D). We find that the models perform on par with one another in the majority of the tasks. Notably, XLM-R is generally the most sensitive to the perturbations in each language compared to the other models. We find that the syntactic perturbations (Article Removal, Stopwords Removal) are better solved than the inflectional ones. Similarly, the countbased baselines receive the best performance on the syntactic perturbations since the latter are obtained over a limited set of words (see Table 5, Appendix C). On the other hand, their performance is typically higher or close to random on the inflectional perturbations (see Table 5, Appendix C). We briefly describe the results in Appendix D for the sake of space.
Layer-wise Neuron Analysis Individual neuron analysis helps to observe how top-neurons are spread across the entire model, and identify the relevance of each layer by the number of its top-neurons 9 (see Section 4.2). (ii) (pre-trained M, fine-tuned M), (iii) (fine-tuned M, fine-tuned M). Figure 3 shows the most typical pattern achieved in the tasks. The biggest difference is observed over the combination (ii), where the perturbations are best captured at the lowerto-middle layers [1 − 6] (XLM-R, MiniLM), or across all the layers (M-BERT, DistilBERT). The middle-to-higher layers [7 − 12] tend to become more similar over combinations (i, iii) which may mean that they are able to restore the semantics of the perturbed sentences, being more robust to the perturbations as opposed to the lower ones.

Discussion
Morphosyntactic content across languages The probing curves under layer-wise probing demonstrate that the multilingual transformers learn the morphosyntactic content in a greatly similar manner despite the language differences (see Section 5.1). The properties are predominantly distributed across the middle-to-higher layers [5 − 12] for each language. In contrast, Masked Token tasks represent a challenge for the models causing rapid increases and decreases in the performance across the layers (see Section 5.2). The overall pattern for each language is that a masked token's properties are best inferred at the middle-to-higher layers. A possible reason for this is that the task requires incorporating syntactic and semantic information from the context since the target word remains unseen. The models demonstrate their sensitivity to Perturbations (see Section 5.4). While the syntactic perturbations are predominantly captured at the lower-to-middle layers [3 − 8], the inflectional ones are stored at the middle-to-higher layers [5 − 12]. In contrast to other languages, the perturbation properties for English may be distributed across all layers of the models. The results are supported by the individual neuron analysis, an example of which is provided in Appendix E.
Same properties require different number of neurons Property-wise neuron analysis shows that Person and Case are learned using more neurons as compared to Number and Gender across the languages. Notably, the number of neurons required to learn a property may depend on the language. For example, D-BERT requires about 1000 neurons to learn Case in German and less than 1500 neurons to learn the property in Russian.
Are students good learners? A common method to compare pre-trained models and their distilled versions is based upon their performance on downstream tasks (Tsai et al., 2019), or NLU benchmarks (Wang et al., 2018(Wang et al., , 2019. Still, little is investigated on what language properties are preserved after the knowledge distillation. We find that D-BERT and MiniLM mimic the behavior of their teachers under layer-wise probing (see Section 5.1), or display a similar perturbation sensitivity under ckasim (see Section 5.4). However, MiniLM tends to exhibit an uncertain behavior as opposed to their teacher (see Sections 5.2, 5.4).

Effect of fine-tuning
The results show that the effect of fine-tuning for POS-tagging varies within a certain group of tasks. First, fine-tuned models may receive a better probing performance by 2-4% on Morphosyntactic Features tasks (see Section 5.1). Second, fine-tuning affects the way the properties are distributed or causes significant performance drops on Masked Token tasks, specifically at the higher layers (see Section 5.2). The impact on the property distribution is also demonstrated on Perturbations tasks under neuron-level probe (see Section 5.4). Besides, the analysis of top-neurons allows concluding that fine-tuning may affect localization (MiniLM, XLM-R) which is in line with (Wu et al., 2020). Finally, a number of neurons required to predict a property may increase (e.g., Russian: Case; French: Person), decrease (e.g., English: Number) or remain unchanged (German). We suggest that an interesting line for future work is to analyze the correlation between the number of neurons and the probe performance after finetuning. For instance, the results on Perturbation tasks indicate that some models may receive a better probing performance with fewer (XLM-R) or more neurons (D-BERT, M-BERT) (see Section 5.4). An exploration of fine-tuning for morphosyntactic analysis, specifically over UniMorph (Kirov et al., 2018) may be a fruitful avenue for future work.
Distribution of knowledge may depend on language morphology The analysis of the models under layer-wise and neuron-wise probing suggests that the behavior may depend on how morphologically rich a language is (see Sections 5.1, 5.4). The knowledge for English tends to be distributed across all layers of the models in contrast to the more morphologically rich languages that capture the properties at the middle-to-higher layers. The finding is in line with a few recent studies (Edmiston, 2020;Durrani et al., 2020b;Elazar et al., 2020) which contradict the common understanding that morphology is stored at the lower layers (Tenney et al., 2019a;Rogers et al., 2020b). We also find that the distribution of the properties varies based on the complexity of a probing task (see Sections 5.1, 5.2). An exciting direction for future work is to test this hypothesis on a more diverse set of morphologically contrasting languages. Besides, perturbing one aspect of a sentence can cause ambiguity elsewhere which is an interesting line for future exploration of the interdependence of the perturbations.

Conclusion
This paper proposes Morph Call, a suite of 46 probing tasks in four Indo-European languages that differ significantly in their richness of morphology: Russian, French, English, and German. The suite includes a new type of probing task based on the detection of syntactic and inflectional sentence perturbations. We apply a combination of three introspection methods based on neuron-, layer-and representation-level analysis to probe five multilingual transformer models, including their less explored distilled versions. The analysis of transformers' understudied aspect contradicts the common findings on how morphology is represented in the models. We find that the knowledge for English is predominantly distributed across all layers of the models in contrast to more morphologically rich languages (German, Russian, French), which house the properties at the middle-to-higher lay-ers. The models demonstrate their sensitivity to the perturbations, and XLM-R tends to be the most robust among the others. We observe that distilled models inherit their teachers' knowledge, showing a comparative performance and exhibiting similar property distribution on several probing tasks. Another finding is that fine-tuning for POS-tagging can affect the model knowledge in various manners, ranging from improving and decreasing the probing classifier performance to changing the information's localization. We believe there is still room for exploring the models' morphosyntactic content and the effect of fine-tuning, specifically across a more diverse set of languages and types of model architectures. Dell'Orletta, and Giulia Venturi. 2020. Linguistic profiling of a neural language model. arXiv preprint arXiv:2010.01869.

B Dataset Statistics
Tables 1 -3 provide a brief statistics on the partition sizes for each probing task.   Table 4 summarizes the results of the baseline models for Morphosyntactic Features tasks. Table 5 presents the performance of the baseline models for Perturbations tasks.

C Baseline Performance
Probing Task Lang Char Num TF-IDF Char TF-IDF BPE TF-IDF

D Probing Classifiers
Morphosyntactic Features Tables 6 -7 summarize the results of the probing classifier on Morphosyntactic Features tasks for pre-trained and fine-tuned models. Figure 1 shows a few examples of the model behavior on the tasks. While Gender in German appears to be the most challenging property among the others for both pre-trained and fine-tuned models, Case in Russian is inferred by the models with great confidence. Tables 8 -9 outline the performance of the probing classifier on Masked Token tasks.

Masked Token
Perturbations Tables 10 -11 present the results of the probing classifier on Perturbations tasks for pre-trained and fine-tuned models. Figures 2 -3 are the graphical representations of the probing classifier performance on Article Removal task for German, and Predicate Number task for French.
The overall pattern for the syntactic perturbations is that the sensitivity is captured at the lower-tomiddle layers [3 − 8] of pre-trained models. In its turn, the inflectional properties are predominantly distributed at the middle-to-higher layers [5 − 12] of both pre-trained and fine-tuned models. However, fine-tuned versions may exhibit unpredictable behavior, an example of which we describe below. Figure 2 demonstrates the results on Article Removal task for German. While the probing curves of pre-trained models tend to be decaying after reaching their peak at the middle layers, they are confidently increasing towards the output layer after the fine-tuning phase. In contrast, a different behavior is observed on Predicate Number task for French (see Figure 3). The layers of many fine-tuned models lose their knowledge (MiniLM:

F POS-Tagging Performance
Tables 12 -15 describe the results of the fine-tuning on POS-tagging task for each language.