Higher-order Comparisons of Sentence Encoder Representations

Representational Similarity Analysis (RSA) is a technique developed by neuroscientists for comparing activity patterns of different measurement modalities (e.g., fMRI, electrophysiology, behavior). As a framework, RSA has several advantages over existing approaches to interpretation of language encoders based on probing or diagnostic classification: namely, it does not require large training samples, is not prone to overfitting, and it enables a more transparent comparison between the representational geometries of different models and modalities. We demonstrate the utility of RSA by establishing a previously unknown correspondence between widely-employed pretrained language encoders and human processing difficulty via eye-tracking data, showcasing its potential in the interpretability toolbox for neural models.


Introduction
Examining the parallels between human and machine learning is a natural way for us to better understand the former and track our progress in the latter. The "black box" aspect of neural networks has recently inspired a large body of work related to interpretability, i.e. understanding of representations that such models learn. In NLP, this push has been largely motivated by linguistic questions, such as: what linguistic properties are captured by neural networks? and to what extent do decisions made by neural models reflect established linguistic theories? Given the relative recency of such questions, much work in the domain so far has been focused on the context of models in isolation (e.g. what does model X learn about linguistic phenomenon Y?) In order to more broadly understand models' representational tendencies, however, it is vital that such questions be formed not only with other models in mind, but also other rep-resentational methods and modalities (e.g. behavioral data, fMRI measurements, etc.). In context of the latter concern, the present-day interpretability toolkit has not yet been able to afford a practical way of reconciling this.
In this work, we employ Representational Similarity Analysis (RSA) as a simple method of interpreting neural models' representational spaces as they relate to other models and modalities. In particular, we conduct an experiment wherein we investigate the correspondence between human processing difficulty (as reflected by gaze fixation measurements) and the representations induced by popular pretrained language models. In our experiments, we hypothesize that there exists an overlap between the sentences which are difficult for humans to process and those for which per-layer encoder representations are least correlated.
Our intuition is that such sentences may exhibit factors such as low-frequency vocabulary, lexical ambiguity, and syntactic complexity (e.g. multiple embedded clauses), etc. that are uncommon in both standard language and, relatedly, the corpora employed in training large-scale language models. In the case of a human reader, encountering such a sentence may result in a number of processing delays, e.g. longer aggregate gaze duration. In the case of a sentence encoder, an uncommon sentence may lead to a degradation of representations in the encoder's layers, wherein a lower layer might learn to encode vastly different information than a higher one. Similarly, different models' representations may emphasize different aspects of these more complex sentences and therefore diverge from each other. With this in mind, our hypothesis is that sentences which are difficult for humans to process are likely to have divergent representations within models' internal layers and between different models' layers.
While these approaches provide valuable insights into how neural networks process a large variety of phenomena, they rely on decoding accuracy as a probe for encoded linguistic information. If properly biased, this means that they can detect whether information is encoded in a representation or not. However, they do not allow for a direct comparison of representational structure between models. Consider a toy dataset of five sentences of interest and three encodings derived from quite different processing models; a hidden state of a trained neural language model, a tf-idf weighted bag-of-words representation, and measurements of fixation duration from an eyetracking device. Probing methods do not allow us to quantify or visualise, for each of these encoding strategies, how the encoder's responses to the five sentences relate to each other. Moreover, probing methods would not directly reveal whether the fixations from the eye-tracking device aligned more closely with the tf-idf representation or the states of the neural language model. In short, while probing classifier methods can establish if phenomena are separable based on the provided representations, they do not tell us about the overall geometry of the representational spaces. RSA, on the other hand, provides a basis for higher-order comparisons between spaces of representations, and a way to visualise and quantify the extent to which they are isomorphic.
Indeed, RSA has seen a modest introduction within interpretable NLP in recent years. For example, Chrupała et al. (2017) employed RSA as a means of correlating encoder representations of speech, text, and images in a post-hoc analysis of a multi-task neural pipeline. Similarly, Bouchacourt and Baroni (2018) used the framework to measure the similarity between input image embeddings and the representations of the same image by an agent in an language game setting. More recently, Chrupała and Alishahi (2019) correlated activation patterns of sentence encoders with symbolic representations, such as syntax trees. Lastly, similar to our work here, Abnar et al. (2019) proposed an extension to RSA that enables the comparison of a single model in the face of isolated, changing parameters, and employed this metric along with RSA to correlate NLP models' and human brains' respective representations of language. We hope to position our work among this brief survey and further demonstrate the flexibility of RSA across several levels of abstraction.

Representational Similarity Analysis
RSA was proposed by Kriegeskorte et al. (2008) as a method of relating the different representational modalities employed in neuroscientific studies. Due to the lack of correspondence between the activity patterns of disparate measurement modalities (e.g. brain activity via fMRI, behavioural responses), RSA aims to abstract away from the activity patterns themselves and instead compute representational dissimilarity matrices (RDMs), which characterize the information carried by a given representation method through dissimilarity structure.
Given a set of representational methods (e.g., pretrained encoders) M and a set of experimental conditions (sentences) N , we can construct RDMs for each method in M . Each cell in an RDM corresponds to the dissimilarity between the activity patterns associated with pairs of experimental conditions n i , n j ∈ N , say, a pair of sentences. When n i = n j , the dissimilarity between an experimental condition and itself is intuitively 0, thus making the N × N RDM symmetric along a diagonal of zeros (Kriegeskorte et al., 2008).
The RDMs of the different representational methods in M can then be directly compared in a Representational Similarity Matrix (RSM). This comparison of RDMs is known as second-order analysis, which is broadly based on the idea of a second-order isomorphism (Shepard and Chipman, 1970). In such an analysis, the principal point of comparison is the match between the dissimilarity structure of the different representa- tional methods. Intuitively, this can be expressed through the notion of distance between distances, and is thus related to Earth Mover's Distance (Rubner et al., 2000). 1 Figure 1 shows an illustration of the first and second order analyses for pretrained language encoders. Note that RSA is meaningfully different from, and complementary to, methods that employ saturating functions of representation distances (e.g. decoding accuracy, mutual information), which suffer from (a) a ceiling effect: being able to distinguish experimental phenomenon A from B with with an accuracy of 100% and experimental phenomenon C from D with an accuracy of 100% does not mean that the distance between A and B is the same as that between C and D; and (b) discretization (Nili et al., 2014).
We follow Kriegeskorte et al. (2008) in using the correlation distance of experimental condition pairs n i , n j ∈ N as a dissimilarity measure, wheren i is the mean of n i 's elements, · is the dot product, and is the l 2 norm: corr(x) = 1 − (n i −n i )·(n j −n j ) (n i −n i 2 (n j −n j 2 . Compared to other measures, correlation distance is preferable as it normalizes both the mean and variance of activity patterns over experimental conditions. Other popular measures include the Euclidean distance and the Malahanobis distance (Kriegeskorte et al., 2006).

Fixation Duration and Encoder Disagreement
Gaze fixation patterns have been shown to strongly reflect the online cognitive processing demands of 1 More precisely, our measure of dissimilarity between experimental conditions is analogous to ground distance and dissimilarity between RDMs to earth mover's distance.
human readers (Raney et al., 2014;Ashby et al., 2005) and to be dependent upon a number of linguistic factors (Van Gompel, 2007). Specifically, it has been demonstrated that word frequency, syntactic complexity, and lexical ambiguity play a strong part in determining which sentences are difficult for humans to process (Rayner and Duffy, 1986;Duffy et al., 1988;Levy, 2008).
Using the RSA framework, we aim to explore how gaze fixation patterns and the linguistic factors associated with sentence processing difficulty relate to the representational spaces of popular language encoders. Namely, we hypothesize that, for a given sentence, disagreement between hidden layers corresponds to processing difficulty. Because layer disagreement for a sentence measures the extent to which two layers (e.g. within BERT) disagree with each other about the pairwise similarity of the sentence (with other sentences in the corpus), a sentence with high layer disagreement will have unstable similarity relationships to other sentences in the corpus. This indicates that it has a degraded encoder representation. Going further, we also hypothesize that models' representations of said sentences may be confounded, in part, by factors that are known to influence humans.
Eye-tracking data For our experiments, we make use of the Dundee eye-tracking corpus (Kennedy et al., 2003), the English part of which consists of eye-movement data recorded as 10 native participants read 2,368 sentences from 20 newspaper articles. We consider the following fixation features: TOTAL FIXATION DURATION and FIRST PASS DURATION. For each of the features, we first take the average of the measurements recorded for all 10 participants per word, then ob-tain sentence-level annotations by summing the measurements of all words in a sentence and dividing by its length. The result of this is two vectors V totf ix and V f irstpass of length 2, 368, where each cell in the vector corresponds to a sentence's average total fixation and average first pass duration, respectively.
Syntactic complexity, word frequency, and lexical ambiguity We also consider the three following linguistic features which affect processing difficulty. For each of the following the result is also a vector of length 2, 368 where each cell corresponds to a sentence: a. the average word log frequency per sentence extracted from the British National Corpus (Leech, 1992), V logF req. . b. the average number of senses per word per sentence extracted from WordNet (Miller, 1995), V wordSense . c. Yngve scores, a standard measure of syntactic complexity based on cognitive load (Yngve, 1960) , V Y ngve .

Pretrained encoders
We conduct our analysis on pretrained BERT-large (Devlin et al., 2018) and ELMo (Peters et al., 2018), two widely employed contextual sentence encoders. To obtain a representation of a sentence from a given layer L, we perform mean-pooling over the time-steps which correspond to the words of a sentence, obtaining a vector representation of the sentence. Meanpooling is a common approach for obtaining vector representations of sentences for downstream tasks (Peters et al., 2018;Conneau et al., 2017b). We refer to ELMo's lowest layer as E1, BERT's 11th layer as B11, etc.
RDMs We construct an RDM (see §2) for each contextual encoder's layers. Each RDM is a 2, 368 × 2, 368 matrix which represents the dissimilarity structure of the layer, (i.e., each row vector in the matrix contains the dissimilarity of a given sentence to every other sentence). We then compute the correlations between the two different RDMs. For our evaluation of how well the representational geometry of a layer correlates to another, we employ Kendall's τ A as suggested in Nili et al. (2014), computing the pairwise correlation for each two corresponding rows in two RDMs. This second-order analysis gives us a pairwise relational similarity vector V Corr L i −L j of length 2, 368, which has the correlations between two layers L i and L j 's RDMs for each of the sentences.
Third-order analysis The final part of our analysis involves computing correlations (Spearman's ρ) of {V Corr L i −L j , V logF req , V Y ngve , V wordSense } with each of V totf ix and V f irstpass . The results from this are shown in Table 1

Discussion
Our results show highly significant negative correlations between V Corr L i −L j and sentence gaze fixation times. These findings confirm the hypothesis that the sentences that are most challenging for humans to process, are the sentences (a) the layers of BERT disagree most on among themselves; and (b) that ELMo and BERT disagree most on, indicating that there may be common factors which affect human processing difficulty and result in disagreement between layers. By Layer disagreement we refer to the expression 1 − V Corr L i −L j . It is important to note that these encoders are trained with a language modelling objective, unlike models where reading behaviour is explicitly modelled (Hahn and Keller, 2016) or predicted (Matthies and Søgaard, 2013). Indeed, the similarities here emerge naturally as a function of the task being performed. This can be seen as analogous to the case of similarities observed between neural networks trained to perform object recognition and spatio-temporal cortical dynamics (Cichy et al., 2016). Figure 2 shows that, for all combinations of BERT layers, total fixation time and Yngve scores have strong negative and positive correlations (respectively) with layer disagreement. Furthermore, we observe that disagreement between middle layers seems to show the strongest correlation with Yngve scores. To confirm this, we split the correlations into four groups: "low" (i, j ∈ [1, 8]), "middle" (i, j ∈   Table 1: Spearman's ρ between V Corr L i −L j , V logF req. , V wordSense , V Y ngve and each of V totf ix and V f irstpass . All correlations significant with p < 0.0001 after Bonferroni correction unless marked with *.

Syntactic complexity
[9, 16]), "high" (i, j ∈ [17, 24]), and "out" (|i − j| > 7), with the latter representing out-ofgroup correlations (e.g. Corr L 1 −L 24 ). To account for correlations between disagreeing adjacent layers (e.g. |i − j| = 1) and Yngve scores being higher (as a possible confounding factor), we also distinguish layers as either "adjacent" or "non-adjacent". Considering these two factors as three-and two-leveled independent variables respectively, we conduct a two-way analysis of variance. The analysis reveals that the effect of group is significant at F (3, 275) = 78.47, p < 0.0001, with "low" (µ = 0.65, σ = 0.08), "middle" (µ = 0.84, σ = 0.03), "high" (µ = 0.80, σ = 0.05), and "out" (µ = 0.80, σ = 0.05). Neither the effect of adjacency nor its interaction with group proved to be significant. This can be seen as (modest) support for the findings of previous work (Blevins et al., 2018;Tenney et al., 2019): namely, that the intermediate layers of neural language models encode the most syntax, and are therefore possibly more sensitive towards syntactic complexity. A very similar pattern is observed for total fixation time. When considered together with the correlation between V Y ngve and fixation times, this indicates a tripartite affinity between layer disagreement, syntactic complexity, and fixation.
Lexical Ambiguity and Word Frequency Finally, we observe that V logF req. has a moderate correlation with both fixation time and layer disagreement and that V wordSense is nearly uncorrelated to both. Detailed plots of the latter can be found in Appendix A.

Conclusion
We presented a framework for analyzing neural network representations (RSA) that allowed us to relate human sentence processing data with language encoder representations. In experiments conducted on two widely used encoders, our findings show that sentences which are difficult for humans to process have more divergent representations both intra-encoder and between different encoders. Furthermore, we lend modest support to the intuition that a model's middle layers encode comparatively more syntax. Our framework offers insight that is complimentary to decoding or probing approaches, and is particularly useful to compare representations from across modalities.