Representation and Pre-Activation of Lexical-Semantic Knowledge in Neural Language Models

In this paper, we perform a systematic analysis of how closely the intermediate layers from LSTM and trans former language models correspond to human semantic knowledge. Furthermore, in order to make more meaningful comparisons with theories of human language comprehension in psycholinguistics, we focus on two key stages where the meaning of a particular target word may arise: immediately before the word’s presentation to the model (comparable to forward inferencing), and immediately after the word token has been input into the network. Our results indicate that the transformer models are better at capturing semantic knowledge relating to lexical concepts, both during word prediction and when retention is required.


Introduction
A wide variety of Natural Language Processing (NLP) tasks have been improved dramatically by the introduction of LSTM (Hochreiter and Schmidhuber, 1997) and transformer-based (Vaswani et al., 2017) neural language models, which can encode the meanings of sentences in such a way that facilitates a range of language tasks (Bengio et al., 2003;Peters et al., 2018;Radford et al., 2018;Dai et al., 2019). Furthermore, both recurrent and transformer networks have been shown to capture a broad range of semantic phenomena and syntactic structure (Dyer et al., 2016;Linzen et al., 2016;Bernardy and Lappin, 2017;Gulordava et al., 2018;Marvin and Linzen, 2018;Lin et al., 2019;Liu et al., 2019;Hewitt and Manning, 2019;Tenney et al., 2019a). Although such models clearly learn aspects of lexical semantics, it remains unclear whether and how these networks capture semantic features associated with conceptual meaning. Some work has demonstrated that word embeddings do reflect conceptual knowledge captured by property norming studies (Rubinstein et al., 2015;Collell and Moens, 2016;Lucy and Gauthier, 2017;Derby et al., 2018), in which human participants produce verbalisable properties for concepts, such as is green or is an amphibian for concepts such as FROG (McRae et al., 2005;Devereux et al., 2014). Such features correspond to stereotypic tacit assumptions (Prince, 1978); common-sense knowledge we have about the real world. There is some evidence that language models implicitly encode such knowledge (Da and Kusai, 2019;Weir et al., 2020); however, coverage of different types of knowledge may be inconsistent, with evidence to suggest that these models fail to capture some types of semantic knowledge such as visual perceptual information (Sommerauer and Fokkens, 2018;Sommerauer, 2020), as well as questions about the completeness of such empirical studies (Fagarasan et al., 2015;Bulat et al., 2016;Silberer, 2017;Derby et al., 2019). In general, there has been only limited work that attempts to investigate whether these neural language models activate lexico-semantic knowledge similarly to humans, further restricted by the fact that such knowledge probing is only performed on latent representations that have received the target concept, ignoring theories of language comprehension and acquisition that emphasise the importance of prediction (Graesser et al., 1994;Dell and Chang, 2014;Kuperberg and Jaeger, 2016).
In this paper, we contribute to the analysis of neural language models by evaluating latent semantic knowledge present in the activation patterns extracted from their intermediate layers. By performing a layer-by-layer analysis, we can uncover how the network composes such meaning as the information propagates through the network, eventually emerging as a rich representation of semantic features that facilitates conditional next word prediction, which is directly dependent on the past knowledge. We perform our layer probing analysis at two temporal modalities. That is, we investigate the hidden layer activations of the NNLMs both before the concept word occurs (which facilitates next word prediction), and after the concept word has been explicitly given to the model. In this way, we determine how richly these latent representations capture real-world perceptual and encyclopaedic knowledge commonly associated with human conceptual meaning.

Related Work
The recent popularity of interpretability in NLP has resulted in strong progress on understanding both recurrent (Alishahi et al., 2019) and transformerbased networks (Rogers et al., 2020). A number of these studies rely on probing techniques, where supervised models are trained to predict specific linguistic phenomena from model activations (Adi et al., 2016;Wallace et al., 2019;Tenney et al., 2019b;Hewitt and Liang, 2019).
There exists some work that analyses semantic knowledge in such networks, though to date this has been more limited than investigations of syntax. Koppula et al. (2018) focused on the recurrent layers of LSTM and GRU networks and attempted to interpret their semantic content by using a set of decoders to predict the previous network inputs. Ettinger (2020) devised a set of psycholinguistic diagnostic tasks to evaluate language understanding in BERT, demonstrating that some phenomena such as semantic role labelling and event knowledge are well-inferred, though others such as negation are less so. Similar to our work, Ethayarajh (2019) mined sentences with words in context to demonstrate that context representations are highly anisotropic, while Bommasani et al. (2020) built static word embeddings from contextual representations using pooling methods, analysing their performance on semantic similarity benchmarks.
Language models have also been successfully employed for predicting activation patterns in the brain during human language comprehension (Jain and Huth, 2018;Toneva and Wehbe, 2019). Such work is particularly relevant from the perspective of predictive coding theories of human language com-prehension (Kuperberg and Jaeger, 2016), which posits that high-level representations of an unfolding utterance facilitate active prediction of subsequent lexical content in the sentence. Neurolinguistic studies provide evidence that such predictions can be of wordform identity (DeLong et al., 2005), or of the semantic features that are expected for the upcoming word (for example, whether the upcoming word is animate or not; Wang et al., 2020).

Neural Language Models
Due to the compatibility issues, we limit our investigation to left-to-right language models that are trained to perform conditional next word prediction, as other SOTA models such as Bert (Devlin et al., 2018) fail to capture the desired criterion that facilities similar mechanisms in language comprehension. For the LSTM-based network, we make use of a very large-scale and influential neural language model developed by Jozefowicz et al. (Jozefowicz et al., 2016), which we refer to as JLM 1 . The model's architecture consists of character-level embeddings with CNNs, followed by a two-layer LSTM with projection layers to reduce dimensionality and a final linear layer with softmax activation. The vocabulary of the output layer consists of 800000 words, and the model is trained using the One Billion Word corpus (Chelba et al., 2013). For the transformer-based model, we make use of the GPT-2 (345M) model (Radford et al., 2019), which consists of 24 multi-head attention layers.

De-Contextualising Representations
There are several problems that emerge when looking to compare concrete conceptual representations of meaning with these neural layer activations. The first is that representations from these latent layers are highly contextualised, which may make it difficult to recover semantic information about a particular concept. The second problem is that recovering a pre-target representation is challenging since it requires contextual information to be supplied to the network before the target word occurs. For our work, we follow a similar approach to Bommasani et al. (2020), and mine a number of sentences from a corpus of text where each target word occurs and then extract representations from each layer of the network before and after the words are presented. For this, we choose a predefined set of target words which are based on the overlap of words in the JLM vocabulary and several intrinsic evaluation benchmarks which are employed in the analyses below. We then sample the training corpus for up to 500 sentences for each target, selecting sentences in which the target word occurred in any position except the start of the sentence. By analysing how the representations perform on the semantic benchmarks, we can infer how these language models compose meaning over the layers of the network.

Feature Pooling
To construct these decontextualized representations, we first compute a hidden state from each of our sentences, and then aggregate them into a single static vector, both at the position of the target word and immediately before. More formally, for each word w ∈ W , where W is our lexicon, we retrieve a set of K sentences {S 1 , S 2 , . . . S k } from the corpus with corresponding timepoints T = {t 1 , t 2 . . . t k } denoting the position of the word w in the sentences, such that S i [t i ] = w for 1 ≤ i ≤ K. Let f L be the function that maps each sentence fragment to a contextual representation from the model f for each layer L in the network. We construct our word-level representation before and after the word w occurs at layer L as follows: This gives us two sets of word embedding vectors for each layer in each network, one set built from activations immediately before the target words and one built from activations immediately after the target words. Since the context differs depending on the sentence, the aggregation performed in the calculations above should preserve only the information associated with the target word. As the model is tasked with predicting the word w, the vectors from the before timestep should contain some semantic information relevant to the target word, even if the word has not been explicitly given to the network.
In the case of GPT-2, input tokens are determined using byte pair encodings, and a given word will correspond to several input units in this encoding. For target words that consist of a number of smaller units that combine into the word, we average the representation over all these positions for the after representations. For the before representations, we take the token immediately before the target word. In the results that follow, we refer to the two sets of embedding vectors for language model M and layer L using the naming convention M[L]-before and M[L]-after. For example, for GPT-2, the word vectors for the fifth multi-head attention layer just before the target word is presented to the network would be GPT2[5]-before.
Note that while LSTMs accumulate a representation of the unfolding utterance at each timestep, this is not entirely true for transformers, which directly combine information from all previous words in the sequence at every layer of the network, guided by attention. In our work, we only care about how the semantic information of the network evolves when it must predict the target word and immediately after.

Evaluation Tasks
For our empirical analysis, we first analyse these layers on classic intrinsic benchmarks that determine their ability to explain human semantic judgments scores on word association, to first determine how well these networks capture the semantic content of the word. We then probe these layers to determine whether they capture a rich set of semantic features related to upcoming concepts and whether such representations are retained by the network for functional use on the prediction task.

Semantic Similarity Benchmarks
Semantic similarity benchmarks, where a set of word pairs are scored by human annotators based on how similar they are, can be used to determine how correlated word pair distances from a set of embedding vectors are with human judgements of similarity for the same words. For the embedding vectors (from each network and network layer), cosine similarity can be used to determine how similar the word vectors are, and these cosine similarities can then be compared with the human judgements using Spearman correlation. Of course, the notion of similarity that informs human judgements is highly dependent on a number of factors such as context, the stimulus set of word pairs, and the instructions given to the human raters (Batchkarov et al., 2016). For this reason, we make use of a number of benchmarks which can be partitioned into two types of relationships, known as semantic similarity and semantic relatedness. For semantic relatedness, we use WordSim353-rel (Agirre et al., 2009) and MEN (Bruni et al., 2012), where a high score between word pairs indicates a greater chance of occurring in the same sentence with some syntactic relation (for example "coffee" and "cup"). For semantic similarity, we use WordSim353-sim (Agirre et al., 2009) and SimLex999 (Hill et al., 2015), where a high score between word pairs indicates a high overlap in semantic attributes or replaceability in a sentence (for example "coffee" and "tea"). Though it does not clearly fall into either the similarity or relatedness categories, we also include the original version of the WordSim judgements, WordSim353 (Finkelstein et al., 2001). Evaluations were performed using the Vecto-ai python package (Rogers et al., 2018).

Neural Activation Similarity
As an extension to these results, we also evaluate how reliable the vector representations from each layer of the networks are in terms of their ability to predict brain imaging data gathered from participants viewing a set of concept words. In this analysis, we use BrainBench (Xu et al., 2016) 2 , a semantic evaluation platform that includes fMRI and MEG neuroimaging data from humans for 60 concept words. This benchmark evaluates how well the semantic models can make predictions about the patterns of neural activations observed in the human participants. For a set of words V , we calculate two pairwise word correlation matrices M D , M B ∈ R |V |×|V | for a distributional semantic model (D) and the brain imaging data (B). We then perform a 2 vs. 2 test between M D and M B , where, for all pairs of words w 1 , w 2 ∈ V , we count how often the similarity structure observed for D agrees with B, i.e. how often where r is Pearson's correlation and M (w 1 ) and M (w 2 ) denote the rows of values corresponding to the concepts w 1 and w 2 , omitting the columns that correspond to the correlation between w 1 and w 2 . The final score is the proportion of positive cases across all word pairs, with 0.5 indicating chance. Intuitively, this is a measure of how well the similarity profile of the semantic model matches the similarity profile of the brain data.

Human Property Knowledge
Next, we determine how well the embedding vectors for each network and layer capture commonsense aspects of meaning reflected in conceptual models from cognitive psychology. We achieve this by using probes to determine whether explicit lexico-semantic knowledge from human-derived property norms can be reliably decoded from these embeddings. For example, for the concept APPLE, can we predict from the embedding vector whether human-elicited properties of that concept such as is-round or grows-on-trees are true? For this analysis, we make use of a dataset of human-elicited property knowledge (the CSLB norms; Devereux et al., 2014) 3 , which lists semantic properties for 638 concept words. These semantic properties are partitioned into five distinct categories, which characterise the different types of information they represent: visual (e.g. is-green; is-round), functional (e.g. is-eaten; used-for-cutting), taxonomic (e.g. isa-fruit; is-a-tool), encyclopedic (e.g. has-vitamins; uses-fuel), and other-perceptual (e.g. is-tasty; isloud). While property norming studies provide an insight into the types of information characterised by human conceptual representations, supported by human agreement on feature attributes, it should be noted that they are not a literal description of human lexical-semantic representation (Barsalou, 2003).

Probing methodology
For the probing analysis, we fit a number of L2regularised logistic regression models, in order to predict whether or not a semantic feature is decodable from our embedding vectors, largely following previous work (Collell and Moens, 2016;Lucy and Gauthier, 2017;Derby et al., 2018). Due to the small sample size, each model uses class weight balancing and decodability is scored using the F1 score over 5 cross-validation folds. More specifically, we preprocess the CSLB dataset to exclude features occurring for fewer than five words. For each feature, we then partition the concepts into five folds using stratified sampling and perform 5-fold cross-validation on each feature.
Due to the high likelihood of overfitting, we also regularise each logistic regression by adding λ  times the L2 norm of the coefficient weights to the loss, where λ is a scaling parameter. Since we want to predict each individual property, we determine what value of λ to use by first performing 5-fold cross-validation for each property over a range of potential values, and choosing the best for each feature.
To calculate a decodability score for each feature, we run 5-fold cross-validation using the best λ value for each feature, for which we obtain the final F1 score on the predictions from the test folds. Furthermore, we repeat this cross-validation process three times and take the average score over each run. We note that just because a linear model does not predict the presence of a property does not mean that it is not encoded in the representation (Collell and Moens, 2016). Nevertheless, linear read-out from model activation patterns (and brain activation patterns) remains a useful tool for determining the presence of high-level information such as linguistic structure in those representations (Hewitt and Liang, 2019).

Semantic Similarity Benchmarks
The similarity benchmark results are displayed in Table 1 and Figure 1. For both JLM and GPT-2, the word vector representations computed after the target word has been presented as an input token to the model perform better in comparison to when the network must predict the target word (the before representations). This result is not surprising, since in the after scenario the models have access to the target word itself. Nevertheless, we still see high correlations for the before representations for most models and layers, indicating that the representational state of the language models immediately before the target word reflect semantic content of the to-be-predicted word. GPT-2 produces the strongest correlations with human similarity judge- Figure 2: Results (accuracy %) on BrainBench for the MEG and fMRI data for each before (on the left) and after (on the right) word embedding models for each of the 24 layer of GPT-2. Scores are measured using accuracy from a 2vs. 2 test, with a score of 0.5 indicating random chance (see text). ments overall (particularly in earlier layers of the after representations; Fig. 1B). Interestingly, JLM outperforms GPT-2 in how accurately it predicts the brain data, perhaps due to a more cognitively plausible neural architecture that incrementally integrates information over the course of a sentence.
Focusing on the before representations, we see that the JLM-before semantic representations tend to perform better than the GPT2-before representations. This is likely because the LSTM is directly trained on the sampled sentences, which produces a lower perplexity measure than the transformer network, and thus it yields more accurate predictions about the target word. Comparing the before representations from different layers in each model, we see that JLM better represents semantic information in the second of its two layers, while for GPT-2 the results are more complex, though later layers are generally better, with the second last layer (23) being best for most evaluations. For both models, then, the upper layers tend to have the best overall semantic representations of the upcoming target word, which follows from the fact that the upper layers directly feed into predictions about the upcoming word in the language modelling task, with the models reflecting the predicted semantic content of that word.
When the target word is available to the model (the after representations), we would expect the network to represent meaningful information about the concept, which is why this approach is the most common method for building contextual representations. Our results support this notion, since the after representations consistently outperform the before representations, on both the word similarity and brain imaging data (see Fig. 2 for the GPT-2 BrainBench results). Notably, JLM[1]-after outperforms JLM[2]-after, since the activation patterns from the second layer should aim to predict the next word in the sequence (i.e. the word following the target word). Similarly, the GPT2-after representations retain semantic information of the word quite well for all but the final layer, with early layers performing well in the semantic similarity evaluations ( Fig. 1B and Fig. 2B). GPT2[24]-after experiences a dramatic loss in performance, similar to what is observed for the JLM[2]-after representations.
Overall, this pattern or results supports the hypothesis that later layers of the language models best reflect semantic information about the to-bepredicted word, whilst earlier layers best reflect semantic information about the just-presented word, though all layers in both models reflect this information to some extent. In the next section, we investigate in more detail the specific kinds of semantic knowledge that is available in different layers of the models.

Semantic Feature Decoding
The results on the property decoding task are presented in Table 2 and Figure 3. Overall, we see that the GPT-2 layers encode more information about common sense property knowledge than the JLM layers, particularly in the after representations.
Focusing on the before representations we see Figure 3: Graph which displays the average cross-validation F1 scores ×100 for each before (on the left) and after (on the right) transformer-based representations from each layer of GPT-2.  that GPT-2 tends to capture more knowledge about conceptual properties than JLM. Most notably, compared to JLM, the GPT-2 model does better at encoding knowledge related to attributive properties (i.e. non-taxonomic properties), which tend to be much more difficult to capture (Rubinstein et al., 2015). Both models show better property decoding performance in the later before layers. As these properties are related to conceptual knowledge plausibly associated with the upcoming word, it makes sense that the embedding vectors converge on some particular space related to the semantic restrictions on the upcoming word, which is particularly reflected in the case of taxonomic properties.

Model
Turning to the after representations, we see that property knowledge seems to be best reflected in the upper layers of both language models. This is a particularly interesting result, as previous work has demonstrated that the lower layers contain more explicit information relating to the target word such as part-of-speech (Peters et al., 2018) and word association (see Section 5.1). Furthermore, while the JLM-after and GPT2-after representations perform similarly when predicting taxonomic features, GPT-2 does much better at capturing perceptual, functional, and encyclopedic knowledge. The results indicate that the GPT-2 representation appear to narrow the gap between taxonomic and attributive properties, which distributional models have historically struggled to accomplish. Finally, the network seems to retain and improve performance as we move through the layers.

Last Layer Performance
First, we wish to discuss why there is a consistent loss in performance from the representations constructed from the final layer of the network, which is notable given the widespread use of the final layer for transfer learning. To better understand the results for the GPT2-before embedding vectors on our evaluation tasks, consider the work of Ethayarajh (2019), who demonstrated that the layers of GPT-2 become more context-specific as we move through the network, more so than LSTMbased networks such as Elmo. In particular, Ethayarajh (2019) investigated intra-sentence similarity, which measures the average cosine distance between the individual word representations and the sentence representation. In their work, sentence representations were constructed by averaging over the hidden states from all time steps in the sentence, which is similar to the before representations (averaging the vectors across sentences given the target word's position). They showed that, when adjusted for anisotropy, the intra-sentence similarity of GPT-2 tends to decrease until layer 4, before uniformly increasing again through the rest of layers. Hence, word representations from different time steps tend to be highly dissimilar from one another by the nature of the network, which demonstrates one limitation of feature pooling. While a limitation, we also note that this approach works well in general for building static word embeddings, supported by previous work (Bommasani et al., 2020)

Semantic Knowledge
From our initial results on the human judgement benchmarks, we can infer at what layers of the network semantic information about the concept is most representative. When the network must perform next word prediction on the concept, we see that the final layer is most representative, whilst after the word has been given to the network, we see that the semantic information about the concept decreases through as we move through the network. Such a result is not surprising as the network must gradually accumulate information that may be related to the next possible word, focusing less on the previous concept. Generally, the transformer outperforms the LSTM model after the network has received the concept in the lower layers, though the LSTM contained more representative information about the concept during next word prediction.
When probing for human conceptual knowledge, we see that the transformers perform better than the LSTMs, with the transformers performing quite well at predicting attributive features in comparison to taxonomic properties, for which there has historically been a large gap in performance (Rubinstein et al., 2015). These results may indicate that context, for which transformers produce highly contextualised representations (Ethayarajh, 2019), plays an important role in representing conceptual knowledge such as that reflected in semantic property norms. The most interesting result from our investigation is that the semantic knowledge is not forgotten in the later layers of both LSTM and transformer-based networks after receiving the concept, unlike the previous results. These findings may indicate that these networks gradually accumulate such knowledge as the sentence is processed in order to facilitate anticipation of the future. Such ideas have recently been proposed by Ferreira and Chantavarin (2018) who suggested that, in order to reconcile the differences between earlier models of integration (building associations between new concepts and previous information (Kintsch and Van Dijk, 1978;Gernsbacher, 1991)) with more recent theories of prediction, we should replace the notion of Prediction with Preparedness. Instead of considering direct prediction of future lexical items, which is usually rare (Luke and Christianson, 2016), the authors suggest that given some new information which is processed along with the past information with appropriate background knowledge, a new rich semantic representation is produced containing informative semantic features that facilitate anticipation. Our results indicate that these language models may similarly build and retain rich semantic representations that aid the network in its learning objective (conditional next word prediction).

Conclusion
In this paper, we present a novel approach to gaining a better understanding of the kinds of semantic information encoded within the layers of largescale language models. Our analysis allows us to peer inside the hidden state representations of neural language models, and examine how semantically relevant information is encoded in each layer of the networks. We examine the language models on their ability to capture semantic meaning from two perspectives, when the network is predicting the target word, and when the target word is the most recent input. The results demonstrate that the transformer model is much better at capturing attributive features than the LSTM model, whilst both models are able to retain rich semantic representations of the concept after the concept has been given to the network.