Interpreting Word-Level Hidden State Behaviour of Character-Level LSTM Language Models

While Long Short-Term Memory networks (LSTMs) and other forms of recurrent neural network have been successfully applied to language modeling on a character level, the hidden state dynamics of these models can be difficult to interpret. We investigate the hidden states of such a model by using the HDBSCAN clustering algorithm to identify points in the text at which the hidden state is similar. Focusing on whitespace characters prior to the beginning of a word reveals interpretable clusters that offer insight into how the LSTM may combine contextual and character-level information to identify parts of speech. We also introduce a method for deriving word vectors from the hidden state representation in order to investigate the word-level knowledge of the model. These word vectors encode meaningful semantic information even for words that appear only once in the training text.


Introduction
Recurrent Neural Networks (RNNs), including Long Short-Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997;Gers et al., 2000), have been widely applied to natural language processing tasks including character-level language modeling (Mikolov et al., 2012;Graves, 2013). However, like other types of neural networks, the hidden states and behaviour of a given LSTM can be difficult to understand and interpret, due to both the distributed nature of the hidden state representations and the relatively opaque relationship between the hidden state and the final output of the network. It is also not clear how a character-level LSTM language model takes advantage of orthographic patterns to infer higherlevel information. * Corresponding author In this paper, we investigate the hidden state dynamics of a character-level LSTM language model both directly and -through the use of output gate activations -indirectly. As an overview, our main contributions are: 1. We use clustering to investigate similar hidden states (and output gate activations) at different points in a text, paying special attention to whitespace characters. We provide insight into the model's awareness of both orthographic patterns and word-level grammatical information.
2. Inspired by our findings from clustering, we introduce a method for extracting meaningful word embeddings from a character-level model, allowing us to investigate the wordlevel knowledge of the model.
First, we use the HDBSCAN clustering algorithm (Campello et al., 2013) to reveal locations within a text at which the hidden state of the LSTM is similar, or at which a similar combination of cell state dimensions is relevant (as determined by output gates). Interestingly, focusing on moments when the network must predict the first letter of a word reveals clusters that are interpretable on the level of words and which display both character-level patterns and grammatical structure (i.e. separating parts of speech). We give examples of clusters of similar hidden states that appear to be heavily influenced by local orthographic patterns but also distinguish between different grammatical functions of the pattern -for example, a cluster containing whitespace characters following possessive uses, but not contractive uses, of the affix "'s". This sheds light on the use of orthographic patterns to infer higher-level information.
We also introduce a method for extracting word embeddings from a character-level model and perform qualitative and quantitative analyses of these embeddings. Surprisingly, this method can assign meaningful representations even to words that appear only once in the text, including associating the rare word "scrutinizingly" with "questioningly" and "attentively", and correctly identifying "deck" as a verb based on a single use despite its lack of meaningful subword components. These results suggests that the model is capable of deducing meaningful information about a word based on the context of a single use. While these embeddings do not achieve state-of-the-art performance on word similarity benchmarks, they do outperform the older methods of Turian et al. (2010) despite the small corpus size and the fact that our language model was not designed with the intent of producing word embeddings. The rest of the paper is structured as follows: The following section describes related work. Section 3 describes the architecture and training of the LSTM language model used in our experiments. In Section 4, we describe our clustering methods and show examples of the clusters found, as well as a part of speech analysis. In Section 5, we describe and analyze our method for extracting word embeddings from the character-level model. Finally, we conclude and suggest directions for future work.

Analyzing Hidden State Dynamics
Many researchers have investigated techniques for understanding the meaning and dynamics of the hidden states of recurrent neural networks. In his seminal paper (Elman, 1990) introducing the simple recurrent network (SRN) (or "Elman network"), Elman uses hierarchical clustering to investigate the hidden states of a word-level RNN modeling a toy language of 29 words. Our approach in Section 4 is in some ways similar, although we use real English data and a characterlevel LSTM model. This also bears some similarities to a visualization technique used by Krakovna and Doshi-Velez (2016) to investigate a hybrid HMM-LSTM model, although their work uses only 10 k-means clusters and does not deeply investigate clustering. Elman also uses principal component analysis to visualize hidden state over time (1991), and many researchers have used dimensionality reduction methods such as t-SNE (Van der Maaten and Hinton, 2008) to visualize similarity between word embeddings, as well as other forms of distributed representation. More recently,  directly visualize representations over time using heatmaps, and Strobelt et al. (2018) develop interactive tools for visualizing LSTM hidden states and testing hypotheses about distributed representations.
Other researchers have investigated methods for clarifying the function of specific hidden dimensions. Karpathy et al. (2015) use static visualizations to demonstrate the existence of cells in an LSTM language model with interpretable behaviour representing long-term dependencies (such as cells tracking line length or quotations in a text). Another approach is that of Kádár et al. (2017), who introduce a "Top K Contexts" method for interpreting the function of certain hidden dimensions, identifying the K points in a sequence which experience the highest activations for the dimension in question.

Character-Level Word Embeddings
Multiple researchers have developed methods for creating word embeddings that incorporate subword level (Luong et al., 2013) or character-level (Santos and Zadrozny, 2014;Ling et al., 2015) information in order to better handle rare or out-ofvocabulary words. These approaches differ from our work in Section 5 in that they use architectures specifically designed to create word embeddings, while we create embeddings from the hidden state of a character-level model not designed for this purpose. In addition, we are interested not in the embeddings themselves, but rather in what they tell us about the word-level knowledge of the language model. Kim et al. (2016) investigate word embeddings created by a character-aware language model; however, the model uses word-level inputs that are further subdivided into character-level information and makes predictions on the word level, while we use an entirely character-level model.

Model
In this paper we focus on the task of language modeling on the character level. Given an input sequence of characters, the model is tasked with predicting the log probability of the following character.
We trained two models on different data sets us-ing the same architecture. Most of the paper focuses on the War and Peace model, but Section 5 uses embeddings derived from the Lancaster-Oslo/Bergen Corpus model when measuring performance against word embedding benchmarks.

Training Data
Our first model uses a relatively small data set, consisting of the text of War and Peace by Tolstoy 1 . This data set was chosen due to its convenience as a sufficiently long but stylistically consistent example of English text. The text contains 3,201,616 characters. We use the first 95% of the data for training and the last 5% for validation. Our second model uses a slightly larger data set, consisting of the Lancaster-Oslo/Bergen (LOB) corpus (Johansson et al., 1978) 2 , which we removed all markup from. This data set draws from a wide variety of fiction and non-fiction texts written in British English in 1961, and contains 5,818,332 characters total. It was chosen for use in Section 5 because it covers a wide range of topics (allowing us to extract word embeddings for a wider vocabulary) while still remaining at a manageable size. We use the last 95% of the data for training and the first 5% for validation.

Model Architecture and Implementation
We use a simple LSTM architecture consisting of a 256-dimensional character embedding layer, followed by three 512-dimensional LSTM layers, and a final layer producing a log softmax distribution over the set of possible characters. The model was implemented in PyTorch (Paszke et al., 2017) using the default LSTM implementation 3 .
This architecture was chosen mostly arbitrarily, and distantly inspired by Karpathy et al. (2015).

Training
The War and Peace model was trained for 170 epochs using stochastic gradient descent and the negative log likelihood loss function, with minibatches of size 100 and truncated backpropagation through time (BPTT) of 100 time steps. During training, dropout was applied after each LSTM layer with a dropout rate of 0.5. The learning rate was initially set to 1 and halved every time the loss on the validation data set plateaued. The final model achieved 1.660 bits-per-character (BPC) on the validation data.
The Lancaster-Oslo/Bergen model was trained for 100 epochs using the PyTorch implementation of AdaGrad, with mini-batches of size 100, truncated BPTT of 100 time steps, a dropout rate of 0.5, and an initial learning rate of 0.01. 4 The final model achieved 1.787 BPC on the validation data.

Cluster Analysis of Character-Level and Word-Level Patterns
In this section we analyze points in the training text by clustering according to hidden state values and output gate activations, revealing a combination of grammatical and word-level patterns reflected in the hidden state of our language model.

Data For Clustering
We created two sets of data for use in clustering: a "full" data set and a "whitespace" data set. To create the "full" data set, we ran our War and Peace language model on the first 50,000 characters 5 of the training data and recorded the hidden state (i.e. the values often denoted h t in the LSTM literature, rather than the cell state c t ) and the sigmoid activations of the output gate of the third LSTM layer at each time step. We focus on the third layer based on the expectation that it will encode more high-level information than earlier layers, an expectation which was supported by brief experimentation on the first layer.
To create the "whitespace" data set, we ran the War and Peace model on the first 250,000 characters of the training data and recorded data only for timesteps when the input character was a space or a new line character.

Basic Clustering Experiment
We chose to use the HDBSCAN clustering algorithm (Campello et al., 2013), since it is designed to work with non-globular clusters of varying density, does not require that an expected number of clusters be specified in advance, and is willing to avoid assigning points to a cluster if they do not  Using the "full" data set, we attempted to cluster the time steps according to either hidden state or output gate activations. We used the Euclidean metric and the HDBSCAN parameters min cluster size=100 and min samples=10. This was chosen somewhat arbitrarily and not on the basis of a parameter search; we did briefly try other settings during preliminary research and found that the results were similar 6 . Clustering by hidden state values and clustering by output gate activations both produced a number of interpretable clusters 7 . Table 1 shows a representative sample of the clusters found when using the hidden state for clustering 8 . We found that most clusters seemed to have interpretable meanings on the character level, often including characters near the start of words that begin with a particular character or characters, as in clusters 4, 7, and 14. In some cases, these clusters seem to locate orthographic patterns that are useful in predicting the following character; for example, the characters in cluster 4 are often followed by an "h", and cluster 39 contains mostly letters at the end of a word (i.e. usually followed by whitespace). However, we did not find clusters that were characterised only by the following characters and not by patterns in the preceding characters.
More interestingly, clusters consisting of points immediately preceding the start of a word tended to reflect word-level information relating to the preceding word. For example, cluster 54 consists of spaces immediately following the pronouns "he" and "she", as well as the interrogative pronoun "who" 9 , while cluster 56 consists of spaces following certain prepositions. This was observed in both the clusters based on hidden state and the clusters based on output gate activation. This could be due to the fact that the output gate activations, which also impact the hidden state, can be intepreted as choosing which dimensions of the cell state are relevant for the network's "decision" at a given time, and we would expect that word-level information is relevant when choosing a distribution over the first letter of the next word.

Whitespace Clustering and
Part-of-Speech Analysis Since the clusters including whitespace tended to reflect word-level grammatical information (as seen in clusters 54, 56, and 62 from Table 1), we performed another round of clustering restricting our focus to only spaces and new lines. Clustering was performed on the "whitespace" data according to either hidden states or output gate activations, again producing many interpretable clusters 10 .
For the purposes of word-level analysis, each data point (corresponding to a whitespace character in the text) was equated with the word immediately preceding it. The Stanford Part-of-Speech Tagger (Toutanova et al., 2003) was used to tag the text with part of speech (POS) information, and for each cluster the precision (percentage of words in the cluster having a given tag) and recall (percentage of words with a given tag falling  into the cluster) 12 were calculated with respect to each tag. Since the clusters are based only on data corresponding to whitespace, words not followed by whitespace (approximately 16% of all words) were not counted when calculating recall. A selection of clusters, example members, and POS statistics can be seen in Table 2. Clusters are designated "OG" or "HS" for "output gate" and "hidden state" respectively, so "HS-35" means the 35th cluster produced when clustering by hidden state values. These clusters were selected to illustrate the interesting patterns present, rather than to represent "typical" clusters.
The resulting clusters based on hidden states were similar to those based on output gate activations. Both approaches resulted in some clusters based on a mix of orthographic and semantic similarity -for example, both produced a cluster consisting primarily of three-letter verbs beginning with "s" (particularly "sat", "saw", and "say"), as well as clusters consisting of possessive uses of the suffix "'s", but not uses of "'s" as a contraction of "is" (as in "it's", "that's", etc.), despite the existence of several such uses in the text.
In fact, some early experimentation resulted in a distinct cluster for the contractive use of "'s", although this does not occur with the parameters we chose for our canonical data. Additionally, in both cases the majority of clusters contained instances of only a single word or a small set of wordsfor example, a cluster consisting entirely of the word "the", a cluster consisting almost entirely of the words "he" and "she", and a cluster containing only the words "me" and "my". In total, 71% of clusters either contained only one or two words, or were determined by preceding punctuation.
However, there were qualitative differences between the two approaches. Some of the hidden state clusters appear to be based on semantic similarities that go beyond mere grammatical similarity; in particular, cluster HS-35 (as seen in Table 2) contains words related to dialogue (and additional context reveals that members of this cluster always follow the end of a quotation), while cluster HS-57 contains multiple words related to looking (including "gazed", although it does not appear in the table). Additionally, cluster HS-40 finds modal verbs with high precision and 89.9% recall, along with the words "just" and "still", which might be included due to orthographic similarity to "must" and "will".
In contrast, clusters based on output gate activations appear to be somewhat more closely related to orthographic similarities. Several of these clusters display orthographic patterns that corre-late strongly with parts of speech; for example, clusters OG-69 and OG-74 contain "-ion" nouns and "-er" nouns (but not "-er" adjectives) respectively, and rather than including all modal verbs in a single cluster, the output gate clusters group the words "would", "could", and "should" separately from "don't", "won't", and "can't" (which are in turn separate from the cluster containing "will" and "still"). This suggests that characterlevel patterns correlated with grammatical information could strongly influence output gate activations in a way that contributes to the grammatical understanding of the model 13 .

Extracting Word Embeddings
As seen in Section 4.3, hidden states after whitespace characters encode word-level information. This suggests a method for deriving word embeddings from a character-level model, in order to better investigate the model's word-level knowledge.
To obtain word embeddings, we ran the War and Peace model on the entire text of War and Peace, storing hidden state values at each point in the text. We then associated each word appearing at least once in the text 14 with the average hidden state vector for whitespace characters following the word in question. This produced a set of 512-dimensional embeddings for a vocabulary of 15,750 distinct words 15 . Table 3 shows the nearest neighbours 16 of the embeddings of several words, as well as a count of how frequently the word appears in the text. While not all nearest neighbours seem to be relevant (particularly for e.g. "write" and "food"), it nonetheless appears that for words well-represented in the text, these embeddings do reflect meaning (e.g. "loved" is similar to "liked", "soldier" to "officer", and so on). In the case of words that are less well represented (e.g. "write", "food"), the nearest neighbours often seem to be retrieved based more on orthographic similarities; however, "food" is 13 Though the ability of RNNs to learn and represent syntax has been studied in RNNs with explicit access to grammatical structure (Kuncoro et al., 2017), to our knowledge, syntax representations have not been explored in character-level RNNs.
14 Excluding words that are never followed immediately by a whitespace character (about 16% of all words). 15 17,510 words in total, but 1,760 are a combination of two words joined by an em-dash. We ignore these "words" in our nearest neighbours analysis. 16 A tool from scikit-learn (Pedregosa et al., 2011) was used to find nearest neighbours by cosine similarity. Using the Euclidean metric instead gives very similar results.  still associated with nouns, and "write" with verbs, and more generally the embedding usually appears to at least reflect basic part of speech information. More surprising, however, is the treatment of words that appear only once in the text. In some cases, the embeddings of these words do reflect not only grammatical information but also their actual meaning; the word "moscovite", for example, is correctly associated with the words "moravian" and "chinese" which also describe geographic origin, and the word "scrutinizingly" is associated with "questioningly" and "challengingly". In these cases, since the word "moscovites" and various forms of "scrutinize" do appear more frequently in the text, it is possible that orthographic similarity and an understanding of morphemes such as "-s", "-ing" and "-ly" contribute to these embeddings. This would be consistent with the findings of e.g. Santos and Zadrozny (2014) and others who have used the orthographic information associated with words to develop word embeddings that perform well for rare words and even out-of-vocabulary words.   (Faruqui and Dyer, 2014)) for the Metaoptimize (Turian et al., 2010) and Skip-Gram (Mikolov et al., 2013) embeddings. For each set of embeddings and each task we list the number of word pairs found and the measured correlation (Spearman's rank correlation coefficient).
However, this does not explain the case of "deck". When this word appears in the text, it is used in its sense as a verb. The only other appearance of the string "deck" in the text is the word "decks", referring to the noun form of the word, and yet the embedding for "deck" is correctly similar to other verbs. For this reason, and because the word "deck" is short and does not consist of meaningful sub-word entities, it is unlikely that the verb-ness of "deck" was deduced from the word itself. This suggests that the model was able to determine the part of speech of the word from its use in a single context (e.g. the fact that it was preceded by "do not"). A similar mechanism may also be responsible for the understanding of the French word "tu", which is correctly identified as a personal pronoun similar to both "you" (its translation, appearing 3,509 times) and "je" (the French 1st-person singular pronoun, appearing 16 times) despite containing little orthographic information. It should also be noted that while it is not the norm for these embeddings of singleton words to reflect meaning (as in the case of "scrutinizingly"), the majority of embeddings do appear to at least identify part of speech (as in the case of "deck"), suggesting a fairly robust mechanism for determining this information from context. The goal of this experiment was not to produce high-quality embeddings, but rather to understand the word-level knowledge of a character-level language model. Nonetheless, we decided to evaluate word embeddings obtained in this manner against some word similarity benchmarks. In order to obtain a broader vocabulary, we used word embed-dings derived from the model we trained on the Lancaster-Oslo/Bergen corpus. While this training data is still quite small (less than 6 million characters), it covers a wider range of authors, styles, and topics, including fiction, non-fiction, scientific papers and news articles, and thus is better suited to producing general-purpose word embeddings. The embeddings we extracted from this corpus cover a vocabulary of 38,981 words.
We assessed these embeddings using the 13 word similarity tasks of http://wordvectors.org (Faruqui and Dyer, 2014), achieving the results shown in Table  4. While these results are far from state-of-the-art, they do outperform the representations of Turian et al. (2010) on all tasks except for MTurk-771. Furthermore, our embeddings perform comparably on the "Rare Words" task compared to several other tasks, despite the small corpus size, presumably due to the use of orthographic and contextual information by the language model.

Discussion and Conclusion
In this paper, we used clustering to investigate the type of information reflected in the hidden states and output gate activations of an LSTM language model. Focusing on whitespace characters revealed clusters containing words with meaningful semantic similarities, as well as clusters reflecting orthographic patterns that correlate with grammatical information.
We also described a method for extracting word embeddings from a character-level language model. Analysis suggests that the model is able to learn meaningful semantic information even about words that appear only once in the training text, using some combination of orthographic and contextual information.
Directions for future work related to our clustering analysis could include applying similar techniques to other RNN architectures (e.g. the GRU of Cho et al. (2014)), comparing the effectiveness of different clustering algorithms for this type of analysis, and scaling up the clustering experiments using more computational resources, a more efficient algorithm, and a larger corpus.
Another promising direction is to expand on the findings of Section 5 by analyzing the quality of word embeddings produced from character-level models trained on a larger corpus, and investigating the capability of character level models to produce word embeddings for out-of-vocabulary words when given a small amount of context.
Collectively, our findings regarding clustering analysis and extraction of word embeddings offer interesting insight into the behaviour of characterlevel recurrent language models, and we hope that they will prove a useful contribution in the ongoing effort to increase the interpretability of recurrent neural networks.