Similarity Analysis of Contextual Word Representation Models

This paper investigates contextual word representation models from the lens of similarity analysis. Given a collection of trained models, we measure the similarity of their internal representations and attention. Critically, these models come from vastly different architectures. We use existing and novel similarity measures that aim to gauge the level of localization of information in the deep models, and facilitate the investigation of which design factors affect model similarity, without requiring any external linguistic annotation. The analysis reveals that models within the same family are more similar to one another, as may be expected. Surprisingly, different architectures have rather similar representations, but different individual neurons. We also observed differences in information localization in lower and higher layers and found that higher layers are more affected by fine-tuning on downstream tasks.


Introduction
Contextual word representations such as ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2019) have led to impressive improvements in a variety of tasks. With this progress in breaking the state of the art, interest in the community has expanded to analyzing such models in an effort to illuminate their inner workings. A number of studies have analyzed the internal representations in such models and attempted to assess what linguistic properties they capture. A prominent methodology for this is to train supervised classifiers based on the models' learned representations, and predict various linguistic properties. For instance, Liu et al. (2019a) train such classifiers on 16 linguistic tasks, including part-of-speech tagging, chunking, named * Equal contribution 1 The code is available at https://github.com/ johnmwu/contextual-corr-analysis. entity recognition, and others. Such an approach may reveal how well representations from different models, and model layers, capture different properties. This approach, known as analysis by probing classifiers, has been used in numerous other studies (Belinkov and Glass, 2019).
While the above approach yields compelling insights, its applicability is constrained by the availability of linguistic annotations. In addition, comparisons of different models are indirect, via the probing accuracy, making it difficult to comment on the similarities and differences of different models. In this paper, we develop complementary methods for analyzing contextual word representations based on their inter-and intra-similarity. While this similarity analysis does not tell us absolute facts about a model, it allows comparing representations without subscribing to one type of information. We consider several kinds of similarity measures based on different levels of localization/distributivity of information: from neuron-level pairwise comparisons of individual neurons to representation-level comparisons of full word representations. We also explore similarity measures based on models' attention weights, in the case of Transformer models (Vaswani et al., 2017). This approach enables us to ask questions such as: Do different models behave similarly on the same inputs? Which design choices determine whether models behave similarly or differently? Are certain model components more similar than others across architectures? Is the information in a given model more or less localized (encoded in individual components) compared to other models? 2 We choose a collection of pre-trained models that aim to capture diverse aspects of modeling choices, including the building blocks (Recurrent Networks, Transformers), language modeling objective (unidirectional, bidirectional, masked, permutation-based), and model depth (from 3 to 24 layers). More specifically, we experiment with variants of ELMo, BERT, GPT (Radford et al., 2018), GPT2 (Radford et al., 2019), and XLNet (Yang et al., 2019). Notably, we use the same methods to investigate the effect that fine-tuning on downstream tasks has on the model similarities.
Our analysis yields the following insights: • Different architectures may have similar representations, but different individual neurons. Models within the same family are more similar to one another in terms of both their neurons and full representations.
• Lower layers are more similar than higher layers across architectures.
• Higher layers have more localized representations than lower layers.
• Higher layers are more affected by fine-tuning than lower layers, in terms of their representations and attentions, and thus are less similar to the higher layers of pre-trained models.
• Fine-tuning affects the localization of information, causing high layers to be less localized.
Finally, we show how the similarity analysis can motivate a simple technique for efficient finetuning, where freezing the bottom layers of models still maintains comparable performance to finetuning the full network, while reducing the finetuning time.

Related Work
The most common approach for analyzing neural network models in general, and contextual word representations in particular, is by probing classifiers (Ettinger et al., 2016;Belinkov et al., 2017;Adi et al., 2017;Conneau et al., 2018;Hupkes et al., 2018), where a classifier is trained on a corpus of linguistic annotations using representations from the model under investigation. For example, Liu et al. (2019a) used this methodology for investigating the representations of contextual word representations on 16 linguistic tasks. One limitation of this approach is that it requires specifying linguistic tasks of interest and obtaining suitable annotations. This potentially limits the applicability of the approach.
An orthogonal analysis method relies on similarities between model representations. Bau et al. (2019) used this approach to analyze the role of individual neurons in neural machine translation. They found that individual neurons are important and interpretable. However, their work was limited to a certain kind of architecture (specifically, a recurrent one). In contrast, we compare models of various architectures and objective functions.
Other work used similarity measures to study learning dynamics in language models by comparing checkpoints of recurrent language models (Morcos et al., 2018), or a language model and a part-ofspeech tagger (Saphra and Lopez, 2019). Our work adopts a similar approach, but explores a range of similarity measures over different contextual word representation models.
Questions of localization and distributivity of information have been under investigation for a long time in the connectionist cognitive science literature (Page, 2000;Bowers, 2002;Gayler and Levy, 2011). While neural language representations are thought to be densely distributed, several recent studies have pointed out the importance of individual neurons (Qian et al., 2016;Shi et al., 2016;Radford et al., 2017;Lakretz et al., 2019;Bau et al., 2019;Dalvi et al., 2019;Baan et al., 2019). Our study contributes to this line of work by designing measures of localization and distributivity of information in a collection of models. Such measures may facilitate incorporating neuron interactions in new training objectives (Li et al., 2020).

Similarity Measures
We present five groups of similarity measures, each capturing a different similarity notion. Consider a collection of M models {f (m) } M m=1 , yielding word representations h  [k] are real (resp. matrix) valued, ranging over words (resp. sentences) in a corpus. Our similarity measures are of the form sim(h , that is, they find similarities between layers. We present the full mathematical details in appendix A.

Neuron-level similarity
A neuron-level similarity measure captures similarity between pairs of individual neurons. We consider one such measure, neuronsim, following Bau et al. (2019). For every neuron k in layer l, neuronsim finds the maximum correlation between it and another neuron in another layer l . Then, it averages over neurons in layer l. 3 This measure aims to capture localization of information. It is high when two layers have pairs of neurons with similar behavior. This is far more likely when the models have local, rather than distributed representations, because for distributed representations to have similar pairs of neurons the information must be distributed similarly.

Mixed neuron-representation similarity
A mixed neuron-representation similarity measure captures a similarity between a neuron in one model with a layer in another. We consider one such measure, mixedsim: for every neuron k in layer l, regress to it from all neurons in layer l and measure the quality of fit. Then, average over neurons in l. It is possible that some information is localized in one layer but distributed in another layer. mixedsim captures such a phenomenon.

Representation-level similarity
A representation-level measure finds correlations between a full model (or layer) simultaneously. We consider three such measures: two based on canonical correlation analysis (CCA), namely singular vector CCA (svsim; Raghu et al. 2017) and projection weighted CCA (pwsim; Morcos et al. 2018), in addition to linear centered kernel alignment (ckasim; Kornblith et al. 2019). 4 These measures emphasize distributivity of informationif two layers behave similarly over all of their neurons, the similarity will be higher, even if no individual neuron has a similar matching pair or is represented well by all neurons in the other layer.
Other representation-level similarity measures may be useful, such as representation similarity analysis (RSA; Kriegeskorte et al. 2008), which has been used to analyze neural network representations (Bouchacourt and Baroni, 2018;Chrupała and Alishahi, 2019;Chrupała, 2019), or other variants of CCA, such as deep CCA . We leave the explorations of such measures to future work.

Attention-level similarity
Previous work analyzing network similarity has mostly focused on representation-based similarities (Morcos et al., 2018;Saphra and Lopez, 2019;Voita et al., 2019a). Here we consider similarity based on attention weights in Transformer models. Analogous to a neuron-level similarity measure, an attention-level similarity measure finds the most "correlated" other attention head. We consider three methods to correlate heads, based on the norm of two attention matrices α [k ], their Pearson correlation, and their Jensen-Shannon divergence. 5 We then average over heads k in layer l, as before. These measures are similar to neuronsim in that they emphasize localization of information-if two layers have pairs of heads that are very similar in their behavior, the similarity will be higher.

Distributed attention-level similarity
We consider parallels of the representation-level similarity. To compare the entire attention heads in two layers, we concatenate all weights from all heads in one layer to get an attention representation. That is, we obtain attention representations α (m) l [h], a random variable ranging over pairs of words in the same sentence, such that α (m) l, (i,j) [h] is a scalar value. It is a matrix where the first axis is indexed by word pairs, and the second by heads. We flatten these matrices and use svsim, pwsim, and ckasim as above for comparing these attention representations. These measures should be high when the entire set of heads in one layer is similar to the set of heads in another layer.

Experimental Setup
Models We choose a collection of pre-trained models that aim to capture diverse aspects of modeling choices, including the building blocks (RNNs, Transformers), language modeling objective (unidirectional, bidirectional, masked, permutationbased), and model depth (from 3 to 24 layers). Data For analyzing the models, we run them on the Penn Treebank development set (Marcus et al., 1993), following the setup taken by Liu et al. (2019a) in their probing classifier experiments. 7 We collect representations and attention weights from each layer in each model for computing the similarity measures. We obtain representations for models used in Liu et al. (2019a) from their implementation and use the transformers library (Wolf et al., 2019) to extract other representations. We aggregate sub-word representations by taking the representation of the last sub-word, following Liu et al. (2019a), and sub-word attentions by summing up at-6 BERT is also trained with a next sentence prediction objective, although this may be redundant (Liu et al., 2019b). 7 As suggested by a reviewer, we verified that the results are consistent when using another dataset (Appendix B.1). tention to sub-words and averaging attention from sub-words, following Clark et al. (2019), which guarantees that the attention from each word sums to one.
5 Similarity of Pre-trained Models 5.1 Neuron and representation levels Figure 1 shows heatmaps of similarities between layers of different models, according to neuronsim and ckasim. Heatmaps for the other measures are provided in Appendix B. The heatmaps reveal the following insights.

Different architectures may have similar representations, but different individual neurons
Comparing the heatmaps, the most striking distinction is that neuronsim induces a distinctly blockdiagonal heatmap, reflecting high intra-model similarities and low inter-model similarities. As neuronsim is computed by finding pairs of very similar neurons, this means that within a model, different layers have similar individual neurons, but across models, neurons are very different. In contrast, ckasim-show fairly significant similarities across models (high values off the main diagonal), indicating that different models generate similar representations. The most similar cross-model similarities are found by mixedsim (Figure 8d in Appendix B), which suggests that individual neurons in one model may be well represented by a linear combination of neurons in another layer. The other representation-level similarities (ckasim, svsim, and pwsim), also show cross-model similarities, albeit to a lesser extent.
Models within the same family are more similar The heatmaps show greater similarity within a model than across models (bright diagonal). Different models sharing the same architecture and objective function, but different depths, also exhibit substantial representation-level similarities -for instance, compare BERT-base and BERTlarge or ELMo-original and ELMo-4-layers, under ckasim (Figure 1b). The Transformer-ELMo presents an instructive case, as it shares ELMo's bidirectional objective function but with Transformers rather than RNNs. Its layers are mostly similar to themselves and the other ELMo models, but also to GPT, more so than to BERT or XLNet, which use masked and permutation language modeling objectives, respectively. Thus it seems that the objective has a considerable impact on representation similarity. 8 The fact that models within the same family are more similar to each other supports the choice of Saphra and Lopez (2019) to use models of similar architecture when probing models via similarity measures across tasks. 9 A possible confounder is that models within the same family are trained on the same data, but cross-family models are trained on different data. It is difficult to control for this given the computational demands of training such models and the current practice in the community of training models on ever increasing sizes of data, rather than a standard fixed dataset. However, Figure 2 shows similarity heatmaps of layers from pre-trained and randomly initialized models using ckasim, exhibiting high intra-model similarities, as before. Interestingly, models within the same family (either GPT2 or XLNet) are more similar than across families, even with random models, indicating that intrinsic aspects of models in a given family make them similar, regardless of the training data or process. 10 As may be expected, in most cases, the similarity between random and pretrained models is small. One exception is the vertical bands in the lower triangle, which indicate that the bottom layers of trained models are similar to many layers of random models. This may be due to random models merely transferring information from bottom to top, without meaningful processing. Still, it may explain why random models sometimes generate useful features (Wieting and Kiela, 2019). Meanwhile, as pointed out by a reviewer, lower layers converge faster, leaving them closer to their initial random state (Raghu et al., 2017;Shwartz-Ziv and Tishby, 2017).
Lower layers are more similar across architectures The representation-level heatmaps ( Figure  1) all exhibit horizontal stripes at lower layers, especially with ckasim, indicating that lower layers are more similar than higher layers when comparing across models. This pattern can be explained by lower layers being closer to the input, which is always the same words. A similar observation has been made for vision networks (Raghu et al., 2017). 11 Voita et al. (2019a) found a similar pattern comparing Transformer models with different objective functions.
Adjacent layers are more similar All heatmaps in Figure 1 exhibit a very bright diagonal and bright lines slightly off the main diagonal, indicating that adjacent layers are more similar. This is even true when comparing layers of different models (notice the diagonal nature of BERT-base vs. BERT-large in Figure 1b), indicating that layers at the same relative depth are more similar than layers at different relative depths. A similar pattern was found in vision networks (Kornblith et al., 2019). Some patterns are unexpected. For instance, comparing XLNet with the BERT models, it appears that lower layers of XLNet are more similar to higher layers of BERT. We speculate that this is an artifact of the permutation-based objective in XLNet.
We found corroborating evidence for this observation in ongoing parallel work, where we compare BERT and XLNet at different layers through word- (Liu et al., 2019a) and sentence-level tasks (Wang et al., 2019): while BERT requires mostly features from higher layers to achieve state-of-the-art results, in XLNet lower and middle layers suffice.
Higher layers are more localized than lower ones The different similarity measures capture different levels of localization vs. distributivity of information. neuronsim captures cases of localized information, where pairs of neurons in different layers behave similarly. svsim captures cases of distributed information, where the full layer representation is similar. To quantify these differences, we compute the average similarity according to each measure when comparing each layer to all other layers. In effect, we take the column-wise mean of each heatmap. We do this separately for svsim as the distributed measure and neuronsim as the localized measure, and we subtract the svsim means from the neuronsim means. This results in a measure of localization per layer. Figure 3 shows the results.
In all models, the localization score mostly increases with layers, indicating that information tends to become more localized at higher layers. 12 This pattern is quite consistent, but may be surprising given prior observations on lower layers capturing phenomena that operate at a local context (Tenney et al., 2019), which presumably require fewer neurons. However, this pattern is in line with observations made by Ethayarajh (2019), who reported that upper layers of pre-trained models produce more context-specific representations. There appears to be a correspondence between our localization score and Ethayarajh's context-specificity score, which is based on the cosine similarity of representations of the same word in different contexts. Thus, more localized representations are also more context-specific. A direct comparison between context-specificity and localization may be fruitful avenue for future work.
Some models seem less localized than others, 12 Recurrent models are more monotonous than Transformers, echoing results by Liu et al. (2019a) on language modeling perplexity in different layers. especially the ELMo variants, although this may be confounded by their being shallower models. BERT and XLNet models first decrease in localization and then increase. Interestingly, XLNet's localization score decreases towards the end, suggesting that its top layer representations are less context-specific. Figure 4 shows similarity heatmaps using two of the attention-level similarity measures-Jensen-Shannon and ckasim-for layers from 6 models: BERT-base/large, GPT2-small/medium, and XLNet-base/large. Layers within the same model or model family exhibit higher similarities (bright block diagonal), in line with results from the representation-level analysis. In particular, under both measures, GPT2 layers are all very similar to each other, except for the bottom ones. Comparing the two heatmaps, the localized Jensen-Shannon similarity (Figure 4a) shows higher similarities off the main diagonal than the distributed ckasim measure (Figure 4b), indicating that different models have pairs of attention heads that behave similarly, although the collection of heads from two different models is different in the aggregate. Heatmaps for the other measures are provided in Appendix C, following primarily the same patterns.

Attention level
It is difficult to identify patterns within a given model family. However, under the attention-based svsim (Figure 10d in Appendix C), and to a lesser extent pwsim (Figure 10e), we see bright diagonals when comparing different GPT2 (and to a lesser extent XLNet and BERT) models, such that layers at the same relative depth are similar in their attention patterns. We have seen such a result also in the representation-based similarities. Adjacent layers seem more similar in some cases, but these patterns are often swamped by the large intra-model similarity. This result differs from our results for representational similarity.
GPT2 models, at all layers, are similar to the bottom layers of BERT-large, expressed in bright vertical bands. In contrast, GPT2 models do not seem to be especially similar to XLNet. Comparing XLNet and BERT, we find that lower layers of XL-Net are quite similar to higher layers of BERT-base and middle layers of BERT-large. This parallels the findings from comparing representations of XLNet and BERT, which we conjecture is the result of the permutation-based objective in XLNet.
In general, we find the attention-based similarities to be mostly in line with the neuron-and representation-level similarities. Nevertheless, they appear to be harder to interpret, as fine-grained patterns are less noticeable. One might mention in this context concerns regarding the reliability of attention weights for interpreting the importance of input words in a model (Jain and Wallace, 2019;Serrano and Smith, 2019;Brunner et al., 2020). However, characterizing the effect of such concerns on our attention-based similarity measures is beyond the current scope.

Similarity of Fine-tuned Models
How does fine-tuning on downstream tasks affect model similarity? In this section, we compare pretrained models and their fine-tuned versions. We use four of the GLUE tasks (Wang et al., 2019): MNLI A multi-genre natural language inference dataset (Williams et al., 2018), where the task is to predict whether a premise entails a hypothesis.
QNLI A conversion of the Stanford question answering dataset (Rajpurkar et al., 2016), where the task is to determine whether a sentence contains the answer to a question.
QQP A collection of question pairs from the Quora website, where the task is to determine whether two questions are semantically equivalent.

Results
Top layers are more affected by fine-tuning Figure 5 shows representation-level ckasim similarity heatmaps of pre-trained (not fine-tuned) and fine-tuned versions of BERT and XLNet. The most striking pattern is that the top layers are more affected by fine-tuning than the bottom layers, as evidenced by the low similarity of high layers of the pre-trained models with their fine-tuned counterparts. Hao et al. (2019) also observed that lower layers of BERT are less affected by fine-tuning than top layers, by visualizing the training loss surfaces. 13 In Appendix D, we demonstrate that this observation can motivate a more efficient finetuning process, where some of the layers are frozen while others are fine-tuned.
There are some task-specific differences. In BERT, the top layers of the SST-2-fine-tuned model are affected more than other layers. This may be because SST-2 is a sentence classification task, while the other tasks are sentence-pair classification. A potential implication of this is that non-SST-2 tasks can contribute to one another in a multi-task finetuning setup. In contrast, in XLNet, fine-tuning on any task leads to top layers being very different from all layers of models fine-tuned on other tasks. This suggests that XLNet representations become very task-specific, and thus multi-task fine-tuning may be less effective with XLNet than with BERT.
Observing the attnsim similarity based on Jensen-Shannon divergence for base and fine-tuned models (Figure 6), we again see that top layers have lower similarities, implying that they undergo greater changed during fine-tuning. Other attentionbased measures behaved similarly (not shown). Ko-valeva et al. (2019) made a similar observation by comparing the cosine similarity of attention matrices in BERT, although they did not perform crosstask comparisons. In fact, the diagonals within each block indicate that bottom layers remain similar to one another even when fine-tuning on different tasks, while top layers diverge after finetuning. The vertical bands at layers 0 mean that many higher layers have a head that is very similar to a head from the first layer, that is, a form of redundancy, which can explain why many heads can be pruned (Michel et al., 2019;Voita et al., 2019b;Kovaleva et al., 2019). Comparing BERT and XLNet, the vertical bands at the top layers of BERT (especially in MNLI, QQI, and SST-2) suggest that some top layers are very similar to any other layer. In XLNet, top MNLI layers are quite Fine-tuning affects localization Figure 7 shows localization scores for different layers in pretrained and fine-tuned models. In contrast to the pre-trained models, the fine-tuned ones decrease in localization at the top layers. This decrease may be the result of top layers learning high-level tasks, which require multiple neurons to capture properly.

Conclusion
In this work, we analyzed various prominent contextual word representations from the perspective of similarity analysis. We compared different layers of pre-trained models using both localized and distributed measures of similarity, at neuron, representation, and attention levels. We found that different architectures often have similar internal representations, but differ at the level of individual neurons. We also observed that higher layers are more localized than lower ones. Comparing finetuned and pre-trained models, we found that higher layers are more affected by fine-tuning in their representations and attention weights, and become less localized. These findings motivated experimenting with layer-selective fine-tuning, where we were able to obtain good performance while freezing the lower layers and only fine-tuning the top ones. Our approach is complementary to the linguistic analysis of models via probing classifiers. An exciting direction for future work is to combine the two approaches in order to identify which linguistic properties are captured in model components that are similar to one another, or explicate how localization of information contributes to the learnability of particular properties. It may be insightful to compare the results of our analysis to the loss surfaces of the same models, especially before and after fine-tuning (Hao et al., 2019). One could also study whether a high similarity entail that two models converged to a similar solution.
Our localization score can also be compared to other aspects of neural representations, such as gradient distributions and their relation to memorization/generalization (Arpit et al., 2017). Finally, the similarity analysis may also help improve model efficiency, for instance by pointing to components that do not change much during fine-tuning and can thus be pruned.
where KL is the KLdivergence and β is the average of the two attention distributions. Then average of words in the corpus.
As before, this gives rise to aggregate measures at the layer level by averaging over heads h.

B Additional Representation-level
Similarity Heatmaps Figure 8 shows additional representation-level similarity heatmaps.

B.1 Effect of Data Used for Similarity Measures
The majority of the experiments reported in the paper are using the Penn Treebank for calculating the similarity measures. Here we show that the results are consistent when using a different dataset, namely the Universal Dependencies English Web Treebank (Silveira et al., 2014). We repeat the experiment reported in Section 5.1. The resulting heatmaps, shown in Figure 9, are highly similar to those generated using the Penn Treebank, shown in Figure

D Efficient Fine-tuning
The analysis results showed that lower layers of the models go through limited changes during finetuning compared to higher layers. We use this insight to improve the efficiency of the fine-tuning process. In standard fine-tuning, back-propagation is done on the full network. We hypothesize that we can reduce the number of these operations by freezing the lower layers of the model since they are the least affected during the fine-tuning process. We experiment with freezing top and bottom layers of the network during the fine-tuning process. Different from prior work (Raghu et al., 2017;Felbo Froze Howard and Ruder, 2018), we freeze the selected layers for the complete fine-tuning process in contrast to freezing various layers for a fraction of the training time. We use the default parameters settings provided in the Transformer library (Wolf et al., 2019): batch size = 8, learning rate = 5e −5 , Adam optimizer with epsilon = 1e −8 , and number of epochs = 3. Table 1 presents the results on BERT and XL-Net. On all of the tasks except QQP, freezing the bottom layers resulted in better performance than freezing the top layers. One interesting observation is that as we increase the number of bottom layers for freezing to six, the performance marginally degrades while saving a lot more computation. Surprisingly, on SST-2 and QNLI, freezing the bottom six layers resulted in better or equal performance than not freezing any layers of both models. With freezing the bottom six layers, one can save backpropagation computation by more than 50%.