Analysing Word Representation from the Input and Output Embeddings in Neural Network Language Models

Researchers have recently demonstrated that tying the neural weights between the input look-up table and the output classification layer can improve training and lower perplexity on sequence learning tasks such as language modelling. Such a procedure is possible due to the design of the softmax classification layer, which previous work has shown to comprise a viable set of semantic representations for the model vocabulary, and these these output embeddings are known to perform well on word similarity benchmarks. In this paper, we make meaningful comparisons between the input and output embeddings and other SOTA distributional models to gain a better understanding of the types of information they represent. We also construct a new set of word embeddings using the output embeddings to create locally-optimal approximations for the intermediate representations from the language model. These locally-optimal embeddings demonstrate excellent performance across all our evaluations.


Introduction
Neural Language Modelling has recently gained popularity in NLP. A Neural Network Language Model (NNLM) is tasked with learning a conditional probability distribution over the occurrences of words in text (Mikolov et al., 2011). This language modelling objective requires a neural network with sufficient capacity to learn meaningful linguistic information such as semantic knowledge and syntactic structure. Due to their ability to learn these important linguistic phenomena, NNLMs have been successfully employed as an effective method for generative pretraining (Dai and Le, 2015) and transfer learning to other natural language tasks (Peters et al., 2018a;Howard and Ruder, 2018;Radford et al., 2018). As previously suggested by Bengio et al. (2003), Mnih and Hinton (2007) and Mnih and Teh (2012), the weights of the final fully-connected output layer, or output embeddings, which compute the conditional probability distribution over the lexicon, also constitute a legitimate set of embedding vectors representing word meaning, as is the case for the input embeddings. This commonality between the input and output layers of the NNLM has motivated researchers to tie these representations together during training, improving performance on language modelling tasks (Inan et al., 2016;Press and Wolf, 2017). Furthermore, such a procedure is intuitive, since both the input and output embeddings of the network would appear to be performing a similar task of encoding information about lexical content. As described by Inan et al. (2016), they clearly live in an identical semantic space in language models, unlike other machine learning models were the input and output embeddings have no direct link.
On the other hand, it would also be reasonable to assume that the output representations require highly task-specific features (Peters et al., 2018a,b;Devlin et al., 2019). Despite their utility in language modelling, in-depth analysis of these input and output vector representations remains limited. The goal of this work is to gain a deeper understanding of the aspects of language captured in these contrasting representations. Our two main contributions 1 are as follows: 1. We perform an investigation to uncover both the broad types of semantic knowledge and fine-grained linguistic phenomena encoded within each set of word representations.
2. We propose a simple method for constructing locally-optimal approximations that we use to extend our analysis to the intermediate representations from the network.
1 Code available at https://github.com/ stevend94/CoNLL2020 Though generally considered task-agnostic, by making extensive comparisons between these neural representations we may reason about the type of information most salient in the representations in each semantic space. Our results demonstrate that the input and output embeddings share little in common with respect to their strength and weaknesses, while the locally-optimal embeddings generally perform the best on most downstream tasks.

Related Work
Recent trends in NLP has seen a focus towards building generative pretraining models, which have achieved state-of-the-art performance on downstream tasks (Peters et al., 2018a;Radford et al., 2018;Devlin et al., 2019;Lan et al., 2019;Liu et al., 2019;Yang et al., 2019). These sequencebased autoencoder models have almost universally adopted the convention of weight tying in their input and output layers, which has been shown to improve training and decrease perplexity scores on language modelling tasks (Inan et al., 2016;Press and Wolf, 2017). Motivated by these results, researchers have proposed a number of modifications to these networks in relation to the output classification layers. For example, Gulordava et al. (2018a) combine weight-tying with a linear projection layer in the penultimate stage of the network to both decouple hidden state representations from the output embeddings and control the size of the embedding vectors. Takase et al. (2017) suggest modifying the architecture of the network by adding a gating mechanism between the input layer and the final classification layer of NNLMs. Focusing solely on the final classification layer, Yang et al. (2017) propose using a number of weighted softmax distributions, called a Mixture of Softmaxes, to overcome the bottleneck formed by their limited capacity. Takase et al. (2018) extend this approach by adding what they call a Direct Output Connection, which computes the probability distribution at all layers of the NNLM. Other work has focused on weight tying such as with the Structural Aware output layer (Pappas et al., 2018;Pappas and Henderson, 2019). Despite their importance, there is limited work which attempts to further analyse these output embeddings beyond the work of Press and Wolf (2017), who show that these representations outperform the input embeddings on word similarity benchmarks. In recent years, such analyses has gained popularity in the NLP community as researchers have shifted their focus towards interpretability in neural networks (Alishahi et al., 2019;. Examples include probing tasks, which are supervised machine learning problems that look to decode salient linguistic features from embedding vectors (Adi et al., 2016;Wallace et al., 2019;Tenney et al., 2019). Other work has focused on determining whether more cognitive aspects of meaning are adequetely encoded within these representations, through probing (Collell and Moens, 2016;Li and Gauthier, 2017;Derby et al., 2020) or using cross-modal mappings (Rubinstein et al., 2015;Fagarasan et al., 2015;Bulat et al., 2016;Derby et al., 2019;Li and Summers-Stay, 2019). Moving beyond basic linguistic phenomena, researchers have also investigated more complex aspects of language such as syntactical structure using probing methods (Linzen et al., 2016;Bernardy and Lappin, 2017;Gulordava et al., 2018b;Marvin and Linzen, 2018).

Research Context and Motivation
In this section, we first discuss some background about the input and output embeddings in NNLMs. Then, we briefly discuss how to compute new representations that are locally-optimal to the prediction step from the fully-connected softmax layer of the NNLM, by using stochastic gradient descent.

Neural Network Language Model
Consider a sequence of text (y 1 , y 2 , . . . y N ) represented as a list of one-hot token vectors. The goal of a neural network language model is to maximize the conditional probability of the next word based on the previous context. For a vocabulary V , at the time step t − 1 the network computes the probability distribution y * t of possible target words as follows: where f consists of one or many temporally compatible layers, such as LSTMs (Hochreiter and Schmidhuber, 1997) or masked transformers (Vaswani et al., 2017). The function f takes in a previous state as contextual information h t−1 ∈ R d f and embeddings e t from the look-up table E ∈ R de×|V | , and produces a new hidden state h t which the fully-connected output layer uses to compute the probability distribution y * t . We then compute the cross-entropy loss L(y t , y * t ) between the predicted distribution and the actual distribution, and minimize the loss with gradient descent.
To consider the case of weight-tying, we first note the fact that the size of the predicted probability distribution must span the length of the lexicon V . Then, disregarding the bias term, as W ∈ R |V |×d f it is easy to see how we can set E = W T if we set d f = d e . Weight tying has several advantages, including less training parameters and improved perplexity scores on language modelling objectives (Inan et al., 2016;Press and Wolf, 2017). However, the information that both the input and output embeddings must individually learn in order to predict the correct target concept may be entirely different.

Hidden State Word Representations
While these output embeddings can function as a set of semantic representations, their real goal is to instead compute the conditional probability distribution over the lexicon using context information from the hidden layers of the network. As such, the output embeddings may contain certain features that are specific to the language modelling objective, allowing them to identify information from the hidden layers that is relevant to predicting the target word. In addition to considering the input and output embeddings, we also consider the activation vectors from the latent layers of the language model in order to extend the scope of our analysis. From the perspective of how these layers represent lexical information, we are interested in the activation vectors in the hidden layers that lead to high prediction probabilities for the target words.
Intuitively, in order to find some activation vector from the latent layers that best represents a particular word, we would like to generate a sentence fragment that is optimal with respect to predicting that word (i.e. the hidden state h t for the sentence fragment yields the highest possible probability value for the target word being the next word in the sequence, given the calculations in Eqn. 3.1). We could then use these hidden state activations for each word as an additional embedding space, similar to Bommasani et al. (2020). However, we lack an efficient generative process for finding such optimal sentence fragments. We could sample a large number of sentence fragments from a corpus and record which sentence fragments give the high-est output probability for each word in our lexicon, but this will be highly inefficient and moreover will not guarantee that we have found the best hidden state activation vector for each word.
In the next section, we present a procedure to identify such optimal hidden states, which we refer to as locally-optimal vectors.

Locally-optimal Vectors
To find a latent representation that maximally predicts the target word from the final classification layer of the NNLM, we build a gradient-based approximation for each word. To achieve this, we employ a similar technique to Activation Maximization in computer vision (Simonyan et al., 2013). For a pretrained NNLM, let W ∈ R d f ×|V | be the weight matrix (i.e. output embeddings) and let b ∈ R |V | be the bias vector of the final prediction layer of the network. For each word in w ∈ V , we want to find the corresponding input I ∈ R d f that maximizes the probability of the word w. Let S w be the score function for the word w ∈ V , which takes an input and gives the probability output of the target class w. We can then formulate the problem as arg max where λ is a regularisation parameter. As described by Simonyan et al. (2013), maximizing the class probability can be achieved by minimizing the score for incorrect classes. This is undesirable for visualization purposes (see Simonyan et al., 2013), which is the reason why softmax normalization is usually omitted, though in our case, finding the most probable class is desirable. The regularisation term stops the magnitude of the vectors growing too large and instead focuses on the angular information between representations. We refer to these representations as AM Embeddings. Although these embeddings have the same dimensionality as the hidden states h t in the NNLM and play the same role in the softmax calculation, we note that they are not derived from any particular text sequence input to the NNLM and indeed there may not exist any sentence fragment that produces these hidden state activations. we may analyse the input and output embeddings as separate entities. We use the freely-available language model of Jozefowicz et al. (2016), which we refer to as JLM. The JLM network consists of a character level embedding input and two LSTM layers of size 8192, which both incorporate a projection layer to reduce the hidden state dimensionality down to 1024. The softmax output of the model has a word-level vocabulary of 800K word classes, and the model is trained on the one billion word news dataset (Chelba et al., 2013).

Pretrained NNLM Embeddings
We first acquire the input and output embeddings by extracting the appropriate matrices from their respective locations in the JLM network, with the input embeddings generated using the character-level layers. We then construct the AM embeddings, first by randomly initialising a set of |V | vectors before optimising using the Adam optimiser with a learning rate of 0.001 and regularisation term λ=10 −5 . We train for 100 epochs, with a batch size of 1024 using Keras. Due to the enormous size of the lexicon of the JLM language model, we downsample the 800K word vocabulary by taking the first 20K most frequently occurring words, which gives good coverage over the evaluation datasets.

Distributional Semantic Models
We also want to compare these embeddings with state-of-the-art distributional semantic models in order to make meaningful comparisons. For this, we use the skip-gram implementation of Word2Vec  and FastText (Bojanowski et al., 2017) using the gensim package 2 and the Python implementation of Facebook's FastText 3 re-spectively. Word2Vec was trained with embeddings of size 300 and a context window of 5, while Fast-Text uses the default settings with embedding size 100, window size 5, and ngrams of sizes from 3 to 6. We also train a Python implementation of GloVe (Pennington et al., 2014) for 100 epochs with a learning rate of 0.05 to construct word embeddings of size 300. For a fair comparison, all models are trained on the same billion-word dataset (Chelba et al., 2013) as JLM.

Experiments
To assess these representations for both taskspecific effectiveness and fine-grained linguistic knowledge, we perform a broad range of experiments. These assessments include comparison with human understanding on word relations (Intrinsic Evaluations), analysing performance on supervised machine learning tasks (Extrinsic Evaluations), and using probing tasks to isolate linguistic phenomena. We hypothesise that the input and output embeddings should perform quite well on the intrinsic benchmarks, while the AM embeddings should give the best results on downstream prediction tasks, which we would similarly expect with the hidden representations from the intermediate layers of the network (Peters et al., 2018a).

Intrinsic Evaluations
We first compare the word embeddings with human semantic judgements of word pair similarity.
The rationale is that a good semantic model should correlate with semantic ground-truth information elicited from humans, either from conscious judgments, or from patterns of brain activation as people process the words (Bakarov, 2018). Similarity Benchmarks A traditional method for evaluating word embeddings uses the intuition of human raters about word semantic similarity. Word similarity benchmarks can, in general, be partitioned into two types: semantic similarity and semantic relatedness. Here, semantic relatedness refers to the strength of association between words (e.g. COFFEE and CUP), while semantic similarity reflects shared semantic properties (e.g. COFFEE and TEA). For benchmarks focusing on semantic relatedness/association, we use MEN (Bruni et al., 2012), MTurk (Radinsky et al., 2011) and WordSim353-Rel (Agirre et al., 2009), and for semantic similarity we use SimLex-999 (Hill et al., 2015), and WordSim353-Sim (Agirre et al., 2009). We also include two datasets whose judgement scores do not fall into either category, Word-Sim353 (Finkelstein et al., 2002) and RareWords (Luong et al., 2013). For the embedding vectors, similarity is computed using the cosine between pairs of word vectors, with Spearman's ρ used to measure the correlation between human scores and the cosine similarities. We perform our analysis using the Vecto python package (Rogers et al., 2018) 4 .

Predicting Brain Data
We also evaluate these embeddings on another intrinsic evaluation task that does not directly employ human semantic judgement. Instead, this evaluation asks whether the embedding models can reliably predict activation patterns in human brain imaging data as participants processed the meanings of words. For this, we use BrainBench (Xu et al., 2016) 5 , a semantic evaluation platform that includes two separate neu-roimaging datasets (fMRI and MEG) from humans for 60 concept words. This benchmark evaluates how well the embeddings can make predictions about the neuroimaging data using a 2 vs. 2 test, with 50% indicating chance accuracy.

Intrinsic evaluation results
In general, the output embeddings perform better than the input embeddings (Table 1), similar to (Press and Wolf, 2017). The only case where the input embeddings yield higher correlations than the output embeddings are on Rare Words. We can attribute this to the fact that the input embeddings are constructed from character-level representations. In comparison to the SOTA distributional models, the output embeddings tend to only beat FastText on Sim-Lex999 and BrainBench, while also struggling in comparison to Word2Vec on semantic relatedness and hybrid tasks. On the other hand, our AM embeddings perform very well in all evaluations, being the top-preforming model in most evaluations and performing quite similarly to FastText on MEN and Rare Words. While we hypothesised that the AM embeddings should perform quite well on downstream tasks, the ability of these novel word embeddings to explain human semantic judgement and reliably decode brain imaging data is surprising and interesting.

Extrinsic Evaluations
Next, we evaluate these representations by analysing their performance on a number of downstream tasks. Each task may demand a certain set of features relevant to the task, requiring these representations to encode a wide range of linguistic knowledge. We expect the output embeddings to perform better than the input embeddings and other  SOTA semantic models based on previous research, which demonstrates that representations from the upper layers of the NNLM tend to perform better at prediction tasks (Peters et al., 2018a,b;Devlin et al., 2019). Since the AM embeddings represent a locally-optimal instance for the penultimate layer of the network, we also expect them to perform well.
Transfer Learning Tasks We make use of Sen-tEval (Conneau et al., 2017), an evaluation suite for analysing the performance of sentence representations. Though we are working with word embeddings, applications rarely require words in isolation.
To build sentence embeddings, we take the average embedding vector of all words in the sentence.
SentEval includes a number of binary classification datasets, including two movie review sentiment datasets (MR) (Pang and Lee, 2005) and (SST2) , a product review dataset (CR) (Hu and Liu, 2004), subjectivity dataset (Subj.) (Pang and Lee, 2004) and an opinion polarity dataset (MPQA) (Wiebe et al., 2005). It also includes two multiclass classifications tasks, a question type classification dataset (TREC) (Voorhees and Tice, 2000) and a movie review dataset with five sentiment classes , as well as an entailment dataset (SICK-E) (Marelli et al., 2014) and paraphrase detection dataset (MRPC) (Dolan et al., 2004). For classification, we use a one-layer PyTorch GPU model with default parameters and Adam optimisation.
The results (Table 2) show that, on binary classification tasks, the input and output embeddings perform quite similarly, while both provide better results than the distributional models in almost all cases. Taking a closer look, we can see that the out- put embeddings perform best at predicting movie review sentiment (MR, SST2) and opinion polarity (MPQA), while the input embeddings provide the highest scores when predicting product review sentiment (CR) and subjectivity (Subj.). When predicting multiple classes (TREC, SST5), the input embeddings perform marginally better than the output embeddings, though the AM embeddings perform best overall on both binary and multiclass datasets. Interestingly, the input embeddings are much better at both predicting entailment (SICK-E) and paraphrase detection (MRPC) than all other models.
Semantic Text Similarity To further evaluate how well these embeddings perform at judging sentence relations, we also employ transfer learning to the semantic relatedness tasks from SemEval, in particular SICK-R (Marelli et al., 2014) and STS B (Cer et al., 2017). The task consists of sentence pairs with scores ranging from 0 to 5, indicating the level of similarity between the sentences. We see from the results ( Table 3) that the input embeddings again give the highest correlation with semantic relatedness scores, similar to the previous results. Furthermore, the AM embeddings perform worse at judging relatedness than the output embeddings, though the differences are quite small. Our AM embeddings still outperform all SOTA distributional models.
We also perform transfer learning on a set of Semantic Textual Similarity (STS) benchmarks,   (Agirre et al., 2016) semantic similarity tasks. Each dataset contains sentence pairs similar to the relatedness tasks, though each is taken from different sources such as news articles or forums. Here, we record performance using the average Pearson and Spearman correlation for each STS dataset, with results displayed in Figure 1. The input embeddings again give the best performance on all datasets, similar to previous results on sentence relatedness. Furthermore, the AM embeddings perform better than the output embeddings on all datasets, in contrast to the previous findings. The results demonstrate that the input embeddings are much more suited to sentence comparison tasks than the other pretrained NNLM embeddings.

Probing Tasks
We next examine whether the embedding vectors capture certain linguistic properties when utilised as sentence representations. These probing tasks are formulated as a supervised classification problem, with strong performance indicating the presence of an isolated characteristic such as sentence length. Similar to the transfer learning tasks, we take the average embedding vector of all words to generate the sentence embedding. These tasks are taken from Conneau et al. (2018), which includes probing tasks partitioned into three separate categories.
• Surface Information: The tasks include sen-tence length prediction (SentLen) and deciding whether a word is present in the representations (WordContent).
• Syntactical Information: Focusing on grammatical structure, these include tasks for predicting the maximum length of a node to the root (TreeDepth) and predicting the top constituent below the <S> node (TopConsts).
• Semantic Information: Focusing on dependency knowledge, these include tasks for predicting the tense of the main verb (Tense), the number of subjects of the main clause (SubjNum) and the number of objects of the main clause (ObjNum).
We exclude other probing tasks that rely on word position in the sentence, since these averaged word embeddings are invariant with respect to word order 6 . The results are displayed in Figure 2. The SOTA distributional models tend to perform worse than the pretrained NNLM representations when predicting SentLen and WordContent, though the output models perform poorly compared to the input and AM embeddings. The AM embeddings perform well, perhaps because of their training objective which incentivises linear separability. When predicting syntactic information, the input and output embeddings perform similarly at classifying TreeDepth and TopConsts, with the AM embeddings performing best. Finally, when predicting  Tense, SubjNum and ObjNum, the output embeddings are superior, which may be due to the output embeddings heavily encoding dependency information that is relevant to predicting the upcoming word during language modelling. Indeed, LSTMs are particularly good at learning dependency information such as subject-verb agreement (Linzen et al., 2016).

Neural Language Modelling
We have demonstrated that the linguistic knowledge captured by the input and output embeddings are moderately distinct. These results may imply that the input and output embeddings of the NNLM require a particular set of non-overlapping characteristics that are important to their respective roles in the NNLM. To further understand whether and how these representations are distinctive to their particular functions in the input and output layers, we perform domain transfer on the language modelling objective. For our evaluation, we test each set of embedding vectors when fixed as certain weights in the network: 1. NNLM In : Fixing our embedding vectors as the lookup table input to the language model. 2. NNLM Out : Fixing the softmax output layer by using the transpose of the stacked embedding vectors as the matrix of dense weights, without a bias vector. 3. NNLM Tied : Fixing the embedding inputs and softmax output by using our embeddings as the tied weights.
Here we expect the input embeddings and output embeddings to perform well in the case of NNLM In and NNLM Out respectively, since in these cases their role is congruent with their origi-nal role in JLM. We also expect the other distributional models to perform well as input embeddings based on previous research. It will also be interesting to see how the AM representations perform since they are trained using output embeddings and thus should share a lot of their linguistic knowledge. If the input and output embeddings perform similarly, we can infer that these representations contain considerable overlap in lexical information. However, if they perform poorly when their roles are switched, we can conclude that these representations must learn some role-specific features not encoded in the other semantic spaces. See the appendix for training details, which closely follow the medium-sized LSTM model presented by Zaremba et al. (2014) with the Penn Treebank dataset (Marcus et al., 1993).

Perplexity Results
Results are displayed in Table 4. In the NNLM In models, we see that the AM embeddings provide the best performance, even outperforming the input embeddings, with the output embeddings and SOTA distributional models performing quite well. We also note that the input embeddings still provide slightly better performance than the output embeddings in this analysis. In the case of the NNLM Out networks, most of the distributional models perform poorly. The NNLM struggles when the distributional models are utilised as fully-connected classification weights, while the output embeddings, which were trained for this task, perform best, though the AM embeddings also perform well. The input embeddings perform poorly in the NNLM Out model, indicating that the output embeddings do encode role-specific knowledge not captured by the other distributional models. Finally, when we tie and fix the weights, the SOTA distributional models and input embeddings do not improve the performance much in the NNLM Tied model. Both the output embeddings and AM embeddings have good performance, and our AM embeddings surprisingly give the best results.

Discussion
We can draw several conclusions from these results. As expected, the type of semantic knowledge these representations capture is dependent on their position in the network.

Semantic Knowledge
The input embeddings struggle with representing word-level semantic relationships though perform well at estimating relatedness between sentences and paraphrase detection. The input embeddings also seem to encode several aspects of surface-level information such as sentence length, which is behavior more expected of contextualised representations of meaning. Indeed, the input embeddings seem to contain at least some qualities that make them suitable for building sentence-level representations. On the other hand, the output embeddings struggle as sentence-level representations. This is not so surprising, since these embeddings are the input components used to construct contextual representations in the intermediate layers, unlike the output embeddings.
The output embeddings seem to correlate more closely with human judgment on the word-level association and neuroimaging data for isolated concept words than the input embeddings. Furthermore, the output embeddings are highly taskspecific to language modelling. Though other distributional semantic models estimate representations of meaning through somewhat similar language modelling objectives, they fail to learn any meaningful knowledge that is transferable to the output classification layer of the language modelling task.

Weight Tying
There are a number of characteristics that each set of representations seem to capture quite well given their position in the architecture of the NNLM. In a tied representation, we would expect the network to learn a set of embedding vectors that encode all such knowledge, though the contribution from each layer may not be entirely equal. Press and Wolf (2017) noted that, due to the update rules that occur when using weight tying between these layers, the output embeddings get updated at each row after every iteration, unlike the input embeddings. This implies a greater degree of similarity of the tied embedding to the untied model's output embedding than to its input embedding. From the perspective of this work, we would also add that a tied representation would be more similar to the output embeddings since the information they capture is more important to the overall learning objective. Based on our results, while the output embedding knowledge is quite transferable to the input embeddings, the converse is false.

Transfer Learning
In recent years, representations from pretrained neural language models have become a popular choice for transfer learning to other tasks. Generally, the intermediate representations from the layers of the network are preferred, since they are contextualised over the sentence and generally perform better in downstream tasks. In our work, we use the AM embeddings to behave as a stand-in for the intermediate layers' hidden states that are locallyoptimal to each particular target word. Similar to these intermediate representations, our AM embeddings perform quite well on downstream NLP tasks. While this is to be expected, the results on the intrinsic evaluations and language modelling tasks are surprising. We would expect these embeddings to learn quite a bit of knowledge from the output embeddings, though the increase in performance on some tasks is striking. This may be due to the activation maximisation training objective that we employ, which forces linear separability between words in the lexicon whilst preserving the semantic information about each word (see Appendix).

Conclusion
We perform an in-depth analysis of the input and output embeddings of neural network language models to investigate what linguistic features are encoded in each semantic space. We also extend our analysis by constructing locally-optimal vectors from the output embeddings, which seem to provide overall better performance on both intrinsic and extrinsic evaluation tasks, beating wellestablished distributional semantic models in almost all evaluations.