Linguistic Profiling of a Neural Language Model

In this paper we investigate the linguistic knowledge learned by a Neural Language Model (NLM) before and after a fine-tuning process and how this knowledge affects its predictions during several classification problems. We use a wide set of probing tasks, each of which corresponds to a distinct sentence-level feature extracted from different levels of linguistic annotation. We show that BERT is able to encode a wide range of linguistic characteristics, but it tends to lose this information when trained on specific downstream tasks. We also find that BERT’s capacity to encode different kind of linguistic properties has a positive influence on its predictions: the more it stores readable linguistic information of a sentence, the higher will be its capacity of predicting the expected label assigned to that sentence.


Introduction
Neural Language Models (NLMs) have become a central component in NLP systems over the last few years, showing outstanding performance and improving the state-of-the-art on many tasks (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019).However, the introduction of such systems has come at the cost of interpretability and, consequently, at the cost of obtaining meaningful explanations when automated decisions take place.
Recent work has begun to study these models in order to understand whether they encode linguistic phenomena even without being explicitly designed to learn such properties (Marvin and Linzen, 2018;Goldberg, 2019;Warstadt et al., 2019).Much of this work focused on the analysis and interpretation of attention mechanisms (Tang et al., 2018;Jain and Wallace, 2019;Clark et al., 2019) and on the definition of probing models trained to predict simple linguistic properties from unsupervised representations.
Probing models trained on different contextual representations provided evidences that such models are able to capture a wide range of linguistic phenomena (Adi et al., 2016;Perone et al., 2018;Tenney et al., 2019b) and even to organize this information in a hierarchical manner (Belinkov et al., 2017;Lin et al., 2019;Jawahar et al., 2019).However, the way in which this knowledge affects the decisions they make when solving specific downstream tasks has been less studied.
In this paper, we extended prior work by studying the linguistic properties encoded by one of the most prominent NLM, BERT (Devlin et al., 2019), and how these properties affect its predictions when solving a specific downstream task.We defined three research questions aimed at understanding: (i) what kind of linguistic properties are already encoded in a pre-trained version of BERT and where across its 12 layers; (ii) how the knowledge of these properties is modified after a fine-tuning process; (iii) whether this implicit knowledge affects the ability of the model to solve a specific downstream task, i.e.Native Language Identification (NLI).To tackle the first two questions, we adopted an approach inspired to the 'linguistic profiling' methodology put forth by van Halteren (2004), which assumes that wide counts of linguistic features automatically extracted from parsed corpora allow modeling a specific language variety and detecting how it changes with respect to other varieties, e.g.complex vs simple language, female vs male-authored texts, texts written in the same L2 language by authors with different L1 languages.Particularly relevant for our study, is that multi-level linguistic features have been shown to have a highly predictive role in tracking the evolution of learners' linguistic competence across time and developmental levels, both in first and second language acquisition scenarios (Lubetich and Sagae, 2014;Miaschi et al., 2020).
Given the strong informative power of these features to encode a variety of language phenomena across stages of acquisition, we assume that they can be also helpful to dig into the issues of interpretability of NLMs.In particular, we would like to investigate whether features successfully exploited to model the evolution of language competence can be similarly helpful in profiling how the implicit linguistic knowledge of a NLM changes across layers and before and after tuning on a specific downstream task.We chose the NLI task, i.e. the task of automatically classifying the L1 of a writer based on his/her language production in a learned language (Malmasi et al., 2017).As shown by Cimino et al. (2018), linguistic features play a very important role when NLI is tackled as a sentence-classification task rather than as a traditional document-classification task.This is the reason why we considered the sentence-level NLI classification as a task particularly suitable for probing the NLM linguistic knowledge.Finally, we investigated whether and which linguistic information encoded by BERT is involved in discriminating the sentences correctly or incorrectly classified by the fine-tuned models.To this end, we tried to understand if the linguistic knowledge that the model has of a sentence affects the ability to solve a specific downstream task involving that sentence.
Contributions In this paper: (i) we carried out an in-depth linguistic profiling of BERT's internal representations (ii) we showed that contextualized representations tend to lose their precision in encoding a wide range of linguistic properties after a fine-tuning process; (iii) we showed that the linguistic knowledge stored in the contextualized representations of BERT positively affects its ability to solve NLI downstream tasks: the more BERT stores information about these features, the higher will be its capacity of predicting the correct label.

Related Work
In the last few years, several methods have been devised to obtain meaningful explanations regarding the linguistic information encoded in NLMs (Belinkov and Glass, 2019).They range from techniques to examine the activations of individual neurons (Karpathy et al., 2015;Li et al., 2016;Kádár et al., 2017) to more domain specific approaches, such as interpreting attention mechanisms (Raganato and Tiedemann, 2018;Kovaleva et al., 2019;Vig and Belinkov, 2019), studying correlations between representations (Saphra and Lopez, 2019) or designing specific probing tasks that a model can solve only if it captures a precise linguistic phenomenon using the contextual word/sentence embeddings of a pre-trained model as training features (Conneau et al., 2018;Zhang and Bowman, 2018;Hewitt and Liang, 2019;Miaschi and Dell'Orletta, 2020).These latter studies demonstrated that NLMs are able to encode a variety of language properties in a hierarchical manner (Belinkov et al., 2017;Blevins et al., 2018;Tenney et al., 2019b) and even to support the extraction of dependency parse trees (Hewitt and Manning, 2019).Jawahar et al. (2019) investigated the representations learned at different layers of BERT, showing that lower layer representations are usually better for capturing surface features, while embeddings from higher layers are better for syntactic and semantic properties.Using a suite of probing tasks, Tenney et al. (2019a) found that the linguistic knowledge encoded by BERT through its 12/24 layers follows the traditional NLP pipeline: POS tagging, parsing, NER, semantic roles and then coreference.Liu et al. (2019), instead, quantified differences in the transferability of individual layers between different models, showing that higher layers of RNNs (ELMo) are more task-specific (less general), while transformer layers (BERT) do not exhibit this increase in task-specificity.

Our Approach
To probe the linguistic knowledge encoded by BERT and understand how it affects its predictions in several classification problems, we relied on a suite of 68 probing tasks, each of which corresponds to a distinct feature capturing lexical, morpho-syntactic and syntactic properties of a sentence.Specifically, we defined three sets of experiments.The first consisted in probing the linguistic information learned by a pre-trained version of BERT (BERT-base, cased) using gold sentences annotated according to the Universal Dependencies (UD) framework (Nivre et al., 2016).In particular, we defined a probing model that uses BERT contextual representations for each sentence of the dataset and predicts the actual value of a given linguistic feature across the internal layers.The second set of experiments consisted in investigating variations in the encoded linguistic information between the pre-trained model and 10 different fine-tuned ones obtained training BERT on as many Native Language Identification (NLI) binary tasks.
To do so, we performed again all probing tasks using the 10 fine-tuned models.For the last set of experiments, we investigated how the linguistic competence contained in the models affects the ability of BERT to solve the NLI downstream tasks.

Data
We used two datasets: (i) the UD English treebank (version 2.4) for probing the linguistic information learned before and after a fine-tuning process; (ii) a dataset used for the NLI task, which is exploited both for fine-tuning BERT on the downstream task and for reproducing the probing tasks in the third set of experiments.
UD dataset It includes three UD English treebanks: UD English-ParTUT, a conversion of a multilingual parallel treebank consisting of a variety of text genres, including talks, legal texts and Wikipedia articles (Sanguinetti and Bosco, 2015); the Universal Dependencies version annotation from the GUM corpus (Zeldes, 2017); the English Web Treebank (EWT), a gold standard universal dependencies corpus for English (Silveira et al., 2014).Overall, the final dataset consists of 23,943 sentences.
NLI dataset We used the 2017 NLI shared task dataset, i.e. the TOEFL11 corpus (Blanchard et al., 2013).It contains test responses from 13,200 test takers (one essay and one spoken response transcription per test taker) and includes 11 native languages (L1s) with 1,200 test takers per L1.We selected only written essays and we created pairwise subsets of essays written by Italian L1 native speakers and essays for all the other languages.At the end of this process, we obtained 10 datasets of 2,400 documents (33,756 sentences in average): 1,200 for the Italian L1 speakers and 1,200 for each of the other L1s included in the TOEFL11 corpus.

Probing Tasks and Linguistic Features
Our experiments are based on the probing tasks approach defined in Conneau et al. (2018), which aims to capture linguistic information from the representations learned by a NLM.In our study, each probing task Table 2: BERT ρ scores (average between layers) for all the linguistic features (All) and for the 9 groups corresponding to different linguistic phenomena.Baseline scores are also reported.
consists in predicting the value of a specific linguistic feature automatically extracted from the parsed sentences in the NLI and UD datasets.The set of features is based on the ones described in Brunato et al. ( 2020) which are acquired from raw, morpho-syntactic and syntactic levels of annotation and can be categorised in 9 groups corresponding to different linguistic phenomena.As shown in Table 1, these features model linguistic phenomena ranging from raw text ones, to morpho-syntactic information and inflectional properties of verbs, to more complex aspects of sentence structure modeling global and local properties of the whole parsed tree and of specific subtrees, such as the order of subjects and objects with respect to the verb, the distribution of UD syntactic relations, also including features referring to the use of subordination and to the structure of verbal predicates.

Models
NLM We relied on the pre-trained English version of BERT (BERT-base cased, 12 layers, 768 hidden units) for both the extraction of contextual embeddings and the fine-tuning process for the NLI downstream task.To obtain the embeddings representations for our sentence-level tasks we used for each of its 12 layers the activation of the first input token ([CLS]), which somehow summarizes the information from the actual tokens, as suggested in Jawahar et al. (2019).
Probing model As mentioned above, each of our probing tasks consists in predicting the actual value of a given linguistic feature given the inner sentence representations learned by a NLM for each of its layers.Therefore, we used a linear Support Vector Regression (LinearSVR) as probing model.

Profiling BERT
Our first experiments investigated what kind of linguistic phenomena are encoded in a pre-trained version of BERT.To this end, for each of the 12 layers of the model (from input layer -12 to output layer -1), we firstly represented each sentence in the UD dataset using the corresponding sentence embeddings according to the criterion defined in Sec.3.3.We then performed for each sentence representation our set of 68 probing tasks using the LinearSVR model.Since most of our probing features are strongly correlated with sentence length, we compared the probing model results with the ones obtained with a baseline computed by measuring the Spearman's rank correlation coefficient (ρ) between the length of the UD dataset sentences and the corresponding probing values.The evaluation is performed with a 5fold cross validation and using Spearman correlation (ρ) between predicted and gold labels as evaluation metric.As a first analysis, we probed BERT's linguistic competence with respect to the 9 groups of probing features.Table 2 reports BERT (average between layers) and baseline scores for all the linguistic features and for the 9 groups corresponding to different linguistic phenomena.As a general remark, we can notice that the scores obtained by BERT's internal representations always outperform the ones obtained with the correlation baseline.For both BERT and the baseline, the best results are obtained for groups including features highly sensitive to sentence length.For instance, this is the case of syntactic features capturing global aspects of sentence structure (Tree structure).However, differently from the baseline, the abstract representations of BERT are also very good at predicting features related to other linguistic information such as morpho-syntactic (POS, Verb inflection) and syntactic one, e.g. the structure of verbal predicate and the order of nuclear sentence elements (Order).
We then focused on how BERT's linguistic competence changes across layers.These results are reported in Figure 1, where we see that the average layerwise ρ scores are lower in the last layers both for all distinct groups and for all features together.As suggested by Liu et al. (2019), this could be due to the fact that the representations that are better-suited for language modeling (output layer) are also those that exhibit worse probing task performance, indicating that Transformer layers trade off between encoding general and probed features.However, there are differences between the considered groups: competences about raw texts features (RawText) and the distribution of POS are lost in the very first layers (by layer -10), while the knowledge about the order of subject/object with respect to the verb, the use of subordination, as well as features related to verbal predicate structure is acquired in the middle layers.
Interestingly, if we consider how the knowledge of each feature changes across layers (Figure 2), we observe that not all features belonging to the same group have an homogeneous behaviour.This is for example the case of the two features included in the RawText group: word length (char per tok) achieves quite lower scores across all layers with respect to the sent length feature.Similarly, the knowledge about POS differs when we consider more granular distinctions.For instance, within the broad categories of verbs and nouns, worse predictions are obtained by sub-specific classes of verbs based on tense, person and mood features (see especially past participle, xpos dist VBN), and by inflected nouns both singular and plural ( NN, NNS).Within the broad set of features extracted from syntactic annotation, we also see that different scores are reported for features referring e.g. to types of dependency relations: those linking a functional POS to its head (e.g.dep dist case, dep dist cc, dep dist conj, dep dist det) are  Table 3: NLI classification results in terms of accuracy.We used the Zero Rule algorithm as baseline.
Note that, for each task, sentences of the 10 languages are paired with the Italian ones (e.g.KOR = KOR-ITA).
better predicted than others relations, such as dep dist amod, advcl.Besides, within the VerbPredicate group, lower ρ scores are obtained by features encoding sub-categorization information about verbal predicates, such as the distribution of verbs by arity (verbal arity 2,3,4), which also remains almost stable across layers.Since we observed these not homogeneous scores within the groups we defined a priori, we investigated how BERT hierarchically encodes across layers all the features.To this end, we clustered the 68 linguistic characteristics according to layerwise probing results: specifically, we performed hierarchical clustering using Euclidean distance as distance metric and Ward variance minimization as clustering method.Interestingly enough, Figure 3 shows that the traditional division of features with respect to the linguistic annotation levels has not been maintained.On the contrary, BERT puts together features from all linguistic groups into clusters of different size.In addition, these clusters gather features that are differently ranked according to the baseline scores (ranking positions are bolded in the figure).For example, the first cluster includes features with similar ρ scores, and both highly and lower ranked by the baseline.All these features model aspects of global sentence structure, e.g.sent length, functional POSs (e.g.upos dist DET, ADP, CCONJ), parsed tree structures (e.g.parse depth, verbal heads dist, avg links len), nuclear elements of the sentence such as subjects (dep dist nsubj), verbs ( VERBS), pronouns ( PRON).

The Impact of Fine-Tuning on Linguistic Knowledge
Once we have probed the linguistic knowledge encoded by BERT across its layers, we investigated how it changes after a fine-tuning process.To do so, we started with the same pre-trained version of the model used in the previous experiment and performed a fine-tuning process for each of the 10 subsets  built from the original NLI corpus (Sec.3.1).We decided to use 50% of each NLI subset for training (40% and 10% for training and development set) and the remaining 50% for testing the accuracy of the newly generated models.
Table 3 reports the results for the 10 binary NLI tasks.As we can notice, BERT achieves good results for all downstream tasks, meaning that is able to discriminate the L1 of a native speaker on a sentencelevel regardless of the L1 pairs taken into account.The best performance is achieved by the model that was fine-tuned on the Korean and Italian pairwise subset, while the lowest scores are obtained with the model trained on the subset consisting of essays written by Spanish and Italian L1 speakers (SPA-ITA).Interestingly, these results seem to reflect typological distances among L1 pairs, with higher scores for languages that are more distant from Italian (Korean, Telugu or Hindi) and lower scores for L1s belonging to the same language family (FRE-ITA or SPA-ITA).
After fine-tuning the model on NLI, we performed again the suite of probing tasks on the UD dataset using the 10 newly generated models and following the same approach discussed in Section 4. Figure 4 reports layerwise mean ρ correlation values for all probing tasks obtained with BERT-base and the other fine-tuned models.It can be noticed that the representations learned by the NLM tend to lose their Figure 6: % of probing features for which the MSE of the sentences correctly classified by BERT-base (Pre-train) and the fine-tuned models (Fine-tune) is lower than that of the incorrectly ones.Results are reported for layers -12, -7 and -1.precision in encoding our set of linguistic features after the fine-tuning process.This is particularly noticeable at higher layers and it possibly suggests that the model is storing task-specific information at the expense of its ability to encode general knowledge about the language.Again, this is particularly evident for the models fine-tuned on the classification of language pairs belonging to the same family, SPA-ITA above all.To study which phenomena are mainly involved in this loss, we computed the differences between the probing tasks results obtained before and after the fine-tuning process.We focused in particular on the scores obtained on the output layer representations (layer -1), since it is the most task-specific (Kovaleva et al., 2019).For each subset, Figure 5 reports the difference between the score of each linguistic feature obtained with the pre-trained model and the fine-tuned one.Not surprisingly, the loss of linguistic knowledge reflects the typological trend observed for overall classification performance.In fact, when the task is to distinguish Italian vs German, French and Spanish L1, BERT loses much of its encoded knowledge for almost all the considered features.This is particularly evident for the morpho-syntactic features (i.e.distribution of upos dist and xpos dist) and for features related to lexical variety (i.e.ttr form, ttr lemma).It seems that for typologically similar languages BERT needs more task-specific knowledge mostly encoded at the level of morpho-syntactic information rather than the structural level.On the contrary, the drop is less pronounced and in most cases not significant for models fine-tuned on the classification of more distant languages (e.g.models fine-tuned on KOR-ITA or TUR-ITA).In this case, the quite stable performance on the probing tasks may suggest that those features were still useful to perform the downstream task.Interestingly, the class of features that decreases significantly in all models are those encoding the knowledge about the tense of verbs.This is particularly the case of the third-person singular verbs in the present tense (xpos dist VBZ) and of verbs in the past tense (xpos dist VBD).A possible explanation could be related to the prompts of essays, which are the same across the NLI dataset.Thus, the textual genre could have favored a quite homogeneous use of verbal morphology features by students of all L1s.This makes this class of features less useful for the identification of native languages.
6 Are Linguistic Features useful for BERT's predictions?
As a last research question we investigated whether the implicit linguistic knowledge affects BERT's predictions when solving the NLI downstream task.To answer this question we have split each NLI subset into two groups, i.e. sentences correctly classified according to the L1 and those incorrectly classified.For the two groups of each NLI subset, we performed the probing tasks using the pre-trained BERT-base and the specific NLI fine-tuned model.For each sentence of the two groups, we calculated the variation between the actual and predicted feature value obtaining two lists of absolute errors.We used the Wilcoxon Rank-sum test to verify whether the two lists were selected from samples with the same distribution.As a general remark, we observed that much more than half of features vary in a statistically significant way between correctly and incorrectly classified sentences.This suggests that BERT's linguistic competence on the two groups of sentences is very different.To deepen the analysis of this difference, we calculated the accuracy achieved by BERT in terms of Mean Square Error (MSE) only for the set of features varying in a significant way. Figure 6 reports the percentage of features for which the MSE of the sentences correctly classified (MSE Pos) is lower than that of the incorrectly ones (MSE Neg).This percentage is significantly higher, thus showing that BERT's capacity to encode different kind of linguistic information could have an influence on its predictions: the more BERT stores readable linguistic information into the representations it creates, the higher will be its capacity of predicting the correct L1.Moreover, we noticed that this is true also (and especially) using the pre-trained model.In other words, this result suggests that the evaluation of the linguistic knowledge encoded in a pre-trained version of BERT on a specific input sequence could be an insightful indicator of its ability in analyzing that sentence with respect to a downstream task.
Interestingly, if we analyze the average length of correct and incorrect classified sentences, the correct ones are much more longer than the others for all tasks (from 3 tokens more for SPA-ITA to 9 for TEL-ITA).This is quite expected for the NLI task, since a higher number of linguistic events possibly occurring in longer sentences are needed to classify the L1 of a sentence (Dell'Orletta et al., 2014).At the same time, longer sentences make more complex the probing tasks because the output space is larger for almost all them.This is an additional evidence that BERT's linguistic knowledge is not strictly related to sentence complexity, but rather to the model's ability to solve a specific downstream task.To confirm this hypothesis and verify whether such tendency does not only depend on sentence length, we trained another LinearSVR that takes as input the sentence length and predict our probing tasks according to correctly or incorrectly classified NLI sentences.Table 4 reports the average Spearman's correlation coefficients between gold and predict probing features for the two classes of sentences.Results showed that, for all the considered language pairs, the LinearSVR achieved higher accuracy for the probing tasks computed with respect to the incorrectly NLI classified sentences.This is an additional evidence that deeper linguistic knowledge is needed for BERT to correctly classify the L1 of a sentences.

Conclusion
In this paper we studied what kind of linguistic properties are stored in the internal representations learned by BERT before and after a fine-tuning process and how this implicit knowledge correlates with the model predictions when it is trained on a specific downstream task.Using a suite of 68 probing tasks, we showed that the pre-trained version of BERT encodes a wide range of linguistic phenomena across its 12 layers, but the order in which probing features are stored in the internal representations does not necessarily reflect the traditional division with respect to the linguistic annotation levels.We also found that BERT tends to lose its precision in encoding our set of probing features after the finetuning process, probably because it is storing more task-related information for solving NLI.Finally, we showed that the implicit linguistic knowledge encoded by BERT positively affects its ability to solve the tested downstream tasks.

Figure 2 :
Figure 2: Layerwise ρ scores for the 68 linguistic features.Absolute baseline scores are reported in column B.

Figure 3 :
Figure 3: Hierarchical clustering of the 68 probing tasks based on layerwise ρ values.Bold numbers correspond to the ranking of each probing feature based on the correlation with sentence length.

Figure 4 :
Figure 4: Layerwise mean ρ scores for the pre-trained and fine-tuned models.
Average ρ scores for sentences correctly and incorrectly classified using only sentence length as input feature.

Table 1 :
Linguistic Features used in the probing tasks.