How LSTM Encodes Syntax: Exploring Context Vectors and Semi-Quantization on Natural Text

Long Short-Term Memory recurrent neural network (LSTM) is widely used and known to capture informative long-term syntactic dependencies. However, how such information are reflected in its internal vectors for natural text has not yet been sufficiently investigated. We analyze them by learning a language model where syntactic structures are implicitly given. We empirically show that the context update vectors, i.e. outputs of internal gates, are approximately quantized to binary or ternary values to help the language model to count the depth of nesting accurately, as Suzgun et al. (2019) recently show for synthetic Dyck languages. For some dimensions in the context vector, we show that their activations are highly correlated with the depth of phrase structures, such as VP and NP. Moreover, with an L1 regularization, we also found that it can accurately predict whether a word is inside a phrase structure or not from a small number of components of the context vector. Even for the case of learning from raw text, context vectors are shown to still correlate well with the phrase structures. Finally, we show that natural clusters of the functional words and the part of speeches that trigger phrases are represented in a small but principal subspace of the context-update vector of LSTM.


Introduction
LSTM (Hochreiter and Schmidhuber, 1997) is one of the most fundamental architectures that support recent developments of natural language processing. It is widely used for building accurate language models by controlling the flow of gradients and tracking informative long-distance dependencies in various tasks such as machine translation, summarization and text generation (Wu et al., 2016;See et al., 2017;Fukui et al., 2016). While attention-based models such as Transformer (Vaswani et al., 2017) and BERT (Devlin et al., 2019) and their extensions are known to encode syntactic information (Clark et al., 2019), some studies show that LSTMs are still theoretically superior in terms of ability to capture syntactic dependency (Hahn, 2019;Dai et al., 2019). Tang et al. (2018) and Mahalunkar and Kelleher (2019) also empirically demonstrate that Transformers do not outperform LSTM with respect to tasks to capture syntactic information.
Recent empirical studies attempt to explain deep neural network models and to answer the questions such as how RNNs capture the long-distance dependencies, and how abstract or syntactic information is embedded inside deep neural network models (Kuncoro et al., 2018;Karpathy et al., 2016;Blevins et al., 2018). They mainly discuss the extent to which the RNN acquires syntax by comparing experimental accuracy on some syntactic structures, such as number agreements (see Section 7 for details). Some studies also investigate in which vector spaces and layers a specific syntactic information is captured (Liu et al., 2018;Liu et al., 2019). Lately, Suzgun et al. (2019) trained LSTM on Dyck-{1,2} formal languages, and showed that it can emulate counter machines. However, no studies have shed light on the inherent mechanisms of LSTM and their relevance to its internal representation in actual text. Weiss et al. (2018b) theoretically showed that under a realistic condition, the computational power of RNNs are much more limited than previously envisaged, despite of the fact that RNNs are Turing complete (Chen et al., 2018). On the other hand, they also showed that LSTM is stronger than the simple RNN (SRNN) and GRU owing to the counting mechanism LSTM is argued to possess. Following these results, Merrill (2019) introduces an inverse temperature θ into the sigmoid and tanh functions and taking limits as θ → ∞, and thus assumes that all gates of LSTM are asymptotically quantized: e.g. lim θ→∞ σ(θx) ∈ {0, 1} and lim θ→∞ σ(θx) tanh(θy) ∈ {−1, 0, 1}. Under the above assumption, it shows LSTMs work like counter machines, or more precisely, the expressiveness of LSTMs is asymptotically equivalent to that of some subclass of counter machines. While those results are significant and giving us theoretical clues to understand how LSTMs acquire syntactic representations as their hidden vectors, it is not yet known whether or not similar phenomena occur in models learned from real-world data. Regarding this point, we show that those quantization actually often happens in real situations and bridge a gap between theories and practical models through statistical analysis of internal vectors of LSTM that are trained from both raw texts and texts augmented by implicit syntactical symbols. We first explore the behaviors of LSTM language models (LSTM-LMs) and the representation of the syntactic structures by giving linearized syntax trees implicitly. Then, we show that LSTM also acquires a representation of syntactic information in their internal vectors even from a raw text, by statistically analyzing the internal vectors corresponding to syntactic functions. We empirically show that the representations of parts of speech such as NP and VP and syntactic functions that specific words have, both of which often act as syntactic triggers, are acquired in the space of context-update vectors, as well as syntactic dependencies are accumulated in the space of context vectors.

LSTM Language Model
In this study, we consider language models based on one-layer LSTM because our aim is to clarify how LSTM captures syntactic structures. For a sentence w 1 w 2 · · · w n , as shown in Figure 1, let h t denote the output vector of an LSTM after feeding the t-th word w t , c t denote the context vector, and − → w t denote the embedding of the word w t . Let LSTM(c, h, − → w , Θ) be a function of c, h and − → w to determine the next output and context vectors: where Θ represents the set of parameters to be optimized. The language model maximizes the probability of the next word w t+1 given the word sequence up to t, w 1:t : (2) s() is the softmax function, and W and b are a weight matrix and a bias vector, respectively. As shown in the equation (2), the history of words up to t−1 does not appear explicitly in the conditional part of the probability. The contextual information is represented in some form in the context vector c t and the output vector h t . The following standard version is used as the target LSTM architecture among multiple variations (Greff et al., 2017): Here, is an Hadamard (element-wise) product, and x t is the concatenated vector of − → w t , c t−1 , h t−1 , and 1. A, B, C, D are weight matrices representing affine transformation. In this paper, u and f , which are derived from x by equations (3) and (4) to directly affect c, are also analyzed in addition to c and h. u and f are called context-update vector and forget vector hereinafter. The fundamental focus of this study is a natural semiquantization of f , c, u, and h, as the result of learning. First, each element of u is approximately quantized, or ternarized, to {−1, 0, 1} as shown in Figure 1(c). This discretization is a consequence of equation (4): the distribution of the first term is almost concentrated on 0 and 1, and that of the second term is concentrated on ±1. We experimentally confirmed that even if each element of u is strictly ternarized by thresholds, it does not lose important information. For example, Table 1 lists the most similar words with the word "her" measured by the internal vectors (see Section 6.2 for details). θ(u), which is obtained by thresholding u by ±0.9, collects syntactically similar words as appropriately as u does.

Internal
Each element of f is also approximately binarized to {0, 1} as seen in Figure 1(a). Context-update vector u is added to c and accumulated as long as the value of f is close to 1. Owning to the effects of such quantization and accumulation, Figure 1(b) shows that the distribution of each element of c will have peaks on integers.
As we discuss in Section 5.2, this quantization enables the accurate counting of the number of words with syntactic features such as the nesting of parenthesis. Note that Figure 1 shows the results of learning from the raw text of Penn Treebank WSJ corpus (Taylor et al., 2003), and the characteristics described above do not change even if the parameters such as datasets and the dimensionality of the vectors have been changed.

Hypotheses and Outline of Analyses
To understand the behavior of LSTM further, we try to answer two kinds of questions: (a) what information is relevant with the syntax, and (b) how this information is correlated with the syntactic behavior. In particular, we will examine: (1) which of the internal vectors (i.e. h, c, and u) of LSTM highly correlates with the prediction of the phrase structure and its nesting (Sections 5.1 and 5.2), and (2) how well these internal vectors or some subsets of their dimensions can predict the syntactic structures (Section 5.3). Since recognition of syntax inevitably requires recognition of the part-of-speech for each word, we also investigate: (3) how the contextual part-of-speech is represented in the internal vectors of the LSTM, and how the differences between them can be captured using PCA (Section 6).

Configuration of Datasets
We use sentences with syntax trees in Peen Treebank Wall Street Journal (PTB-WSJ) corpus (Marcus et al., 1994;Taylor et al., 2003) as data for training and testing. We randomly chose 10% of data for testing. Phrase structures are linearized and inserted into, or replaced with, sentences as auxiliary tokens in several manners as follows: Paren consists of only '(' and ')' without words, Paren+W consists of '(' and ')' and words, Tag consists of '(T' and 'T)' without words, where T represents a tag in Penn Treebank, Tag+W consists of '(T' and 'T)' and words, Words is just a set of raw words.
For example, a sentence in the original data "(NP (DT a) (JJ nonexecutive) (NN director))" is converted to "(() () ())" in Paren and "(NP (DT DT) (JJ JJ) (NN NN) NP)" in Tag. The latter needs some attention; here, each space-separated token such as "(", "(NP", or "JJ)" is considered as a single word. The size of the vocabulary in Paren and Tag is 2 and 140, respectively. For Paren+W and Tag+W, less frequent words were replaced by their parts of speech so that the total number of words was less than 10,000. Additionally, we also included a small experiment using Lisp programs: in particular, we used slib standard library of scheme and conducted experiments under the scenarios Paren and Tag to show that LSTM also works similarly for other "languages" other than WSJ. Note that in the all scenarios above, LSTM does not know the correspondence between "(T" and "T)" for each tag T in advance, because these auxiliary "words" are simply converted to integers like any other words and fed to LSTM. Therefore, syntactic supervision in our experiments is not complete but only hinted.

Learning Models
The simplest architecture for LSTM language model is employed, which is composed of a single LSTM layer with a word embedding and a softmax layer.  Table 2: Micro-averaged precision of prediction for the beginnings of phrases (BOPs), ends of phrases (EOPs), end of sentence (EOS), and raw words.
We compare the accuracy of predicting the next word among different datasets to phenomenologically confirm the acquisition of phrase structures. As shown in Table 2, the end of sentence (EOS) is predicted by LSTM almost perfectly in terms of both precision and recall for all datasets except for Words. Because EOS occurs in a sentence if and only if the numbers of '(T' and 'T)' are equal for all T , we can conjecture that the LSTM model accurately counts the balance and the nesting of them.
In Figure 2, the groups of Beginning of Phrase (BOP, i.e. "(T" for a tag T) and End of Phrase (EOP, i.e. "T)") are separated by the dashed lines. We can see that BOP and EOP are correctly classified across groups (Figure 2(b), 2(c)). Furthermore, each EOP is rarely misclassified to another EOP. This implies that not only the balance of the numbers of '(T'and 'T)' is completely learned, but their order of appearance is also learned quite accurately. Comparing Figure 2(c) to 2(b), we can see that the precisions for BOP and EOP are improved by including intervening words. Similarly, the precisions for the words are also improved by including BOP and EOP (Table 2). These are because the existence of words will serve as a clue to predict phrase structures, and vice versa.

Representation of Syntactic Structures
After these investigations on LSTM, next we will examine how each tag of the phrase structure and the depth of the nesting are embedded in its internal vectors.

Depth of Nested Phrases
We first examine the correlation coefficients between the depth of nesting and the value of each dimension of the context vector c. Results are shown in the upper half of Figure 3(a) and 3(b) for Paren and Paren+W, respectively. There are some dimensions whose correlation are very high; 0.9969, 0.9978, and 0.9995 for Paren, Paren+W, and Lisp, respectively. Letî denote the dimension such that this correlation is maximized. As Figure 3(a) and 3(b) show, we can see that the depth of the nesting linearly correlates with cî and almost equals to |cî| − α, with some constatnt α. In contrast, the values of hî in h are scattered; especially for Paren+W, |hî| does not converge to 1 and has a large variance between 0  and 1. The first term in the right-hand side of equation (6) leads to this variance because the second term is nearly 1 or −1 when the nesting is deep.
In Figure 3(c), we randomly choose dozens of sentences from the test data whose lengths are less than 100, and plot the values of cî as time proceeds. We can see that a mesh structure is obtained with the step height of nearly 1 in spite of the continuous space of c. This is because, as described in Section 2.2, the context-update vectors u are approximately quantized so that uî is almost binarized to ±1. In addition, the end points of the graphs have values of approximately −2 for any sentence. This implies that the EOS can be judged easily by whether a particular dimension of c t is approximately −2 or not.
During this study, Suzgun et al. (2019) independently discovered a similar diagram as Figure 3(c) and 3(d). However, their experiments are conducted only on a very simple formal language Dyck-{1,2} and the number of dimensions are less than 10, as opposed to our experiments in empirical data and high dimensionality of over 100 on the state vectors.  Table 3: L 1 logistic regression from c to determine VP for Tag and Tag+W. We show the number of nonzero elements (#nnz) and its ratio for each regularization. The chance level of prediction is around 0.7.

Prediction with a Single Component
For Tag and Tag+W, there are no dimensions that completely correlate with the depth of the nesting unlike Paren and Paren+W. We extract a dimensionî that has the largest correlation, and plot the relations between cî and the depth of the nesting of NP and VP in Figure 4. While the absolute value of cî increases almost linearly with the depth, its variance is not small except for NP on Tag. Thus, we cannot say that a single element of c purely encodes the depth of the nesting for a particular tag. Each of the right half of Figure 4 shows the two histograms that correspond to cî. We can observe that each activation histogram has peaks at integer values. This shows the effect of the natural quantization of c. We call the ratio of the overlap of the normalized histograms as histogram overlap ratio. The closer the histogram overlap ratio is to 0, the higher discriminative accuracy of the dimension. The minimum histogram overlap ratio of Tag+W are 0.28 for VP and 0.06 for NN. From the perspective of histogram overlap ratio, it is easy for NN and slightly difficult for VP to classify whether a word is in that phrase by a single dimension.
For NN (common noun) tag, from Figure 4(e), it can be seen that there are no single dimension in c that highly correlate with the depth of the nesting (ρ = 0.31). On the other hand, the minimum histogram Linear regression and cî with the highest correlation coefficient are also shown. The regression results are scaled, and the depth plot is slid slightly shifted to the right for clarity.
overlap rate is 0.07, which is sufficiently low. The right histogram of the Figure 4(e) shows that the occurrence of the token '(NN' has an effect of resetting some dimension of c.

Representation by a Subspace
To find a clear representation of the depth of the nesting within c, we try to extract a subspace that have high correlations with it. First, we adopt a linear regression to predict the depth of nesting from c. Second, we examine the number of effective dimensions; the results of regression for VP are shown in Figure 5(a)-(c). Compared with choosing the best single dimension, the correlation coefficients are clearly improved and almost equals to 1; 0.983 for Tag, 0.995 for Tag+W. This also holds for the nesting of lambda in Lisp programs where it is 0.940. We also empirically show that a few dimensions are sufficient to classify whether a word is in VP or not. Table 3 shows the classification accuracies: for Tag+W, we can keep the accuracy more than 0.99 while the ratio of non-zero dimensions decreases to 5%. For the case of Words, i.e. learning from raw text, the coefficients become smaller but still have positively correlate with c, as shown in Figure 5(c); compared to u, c has the smallest prediction error. In summary, the depth of the nesting of phrase structures can be represented by a sum of a relatively small number of elements of the context vector c, and this relationship is approximately linear. The prediction for Words is less accurate than the other datasets with implicitly-given syntax.

Internal Representation of Syntactic Functions
Finally, we investigate how syntactic functions, such as part-of-speech (POS) and functional words, are represented in internal vectors when LSTM is trained for raw text. We also show that their syntactic functions are naturally represented in the context-update vector u, rather than c.

Representation of a Part-of-Speech
We investigate whether the LSTM-LM automatically recognizes POS when learning from raw text, because it is difficult to acquire higher phrase structures without ever recognizing POS. For this purpose, we employ a principal component analysis (PCA) to reduce the dimensionality of internal vectors of LSTM to observe unsupervised clusters. In Figure 6-(a)(b), the vertical axis denotes the standard deviation of each principal component over the observed data. The statistics over all the occurrences of words represented by the blue line shows that the variances are largely influenced by frequent words. Therefore, next we computed the principal components over unique words, as represented by the red lines. For u, the standard deviations for the main components decrease after this processing. This implies that the variance within each frequent word significantly affects the result of the PCA.   and shown on the y-axis. Figure 6(b) shows the effect of cancelling the frequencies of the words. In this analysis, after the number of dimensions is reduced by appling PCA to the internal vectors of all the occurrences, it is applied again to the averaged vectors, each of which corresponds to each unique word (we call this analysis as PCA-uq). We can see that POSs are clustered in u in an unsupervised fashion. In particular, the result of PCA-uq shows there are some dimensions that clearly distinguish similar types such as VB and VBZ, NN and NNS, and also between them. Furthermore, the distinction between verbs and nouns is evident in the first principal component of PCA-uq, at the left panel of of Figure 7.

Representation of Functional Words
Because functional words play an important role in syntactic parsing, revealing their representation in the internal vectors is important for understanding the mechanism of the syntax acquisition by LSTM. To verify if u and other internal vectors represent syntactic role of functional words, we first take the average of vectors for each word, and compute the cosine similarities between them. Table 4 lists words that have the highest similarities to some instances of words. From the tables, it can be seen that the contextupdate vector u captures their syntactic role more appropriately than the context vector c itself. Since c possesses contextual information in a sentence, the co-occurrence of words will affect the similarity through c. We also examined h and confirmed that its clustering ability is basically similar to c.   Representation of "that" in u for each usage in the corpus (t-SNE). Part of speech (not used in learning) are marked with different colors.

Representation of Ambiguity with Functional Words
A word "that" is a representative ambiguous functional word that has multiple grammatical meanings: it has three main meanings, each of which is syntactically similar to the word "if", "this", or "which". Figure 7 shows how these meanings are encoded in u, by mapping to two dimensions using t-SNE. Although they are not completely separated, we can see that they are clustered according to their syntactic behaviors in context.

Related Work
As research on how LSTM tracks long-term dependence, behaviors of LSTM with several dimensions have been studied using artificial languages (Tomita, 1982;Prez-Ortiz et al., 2003;Schmidhuber, 2015). With recent applications of LSTM to various tasks, studies are being conducted on how LSTM recognizes syntax and long-term dependencies (Adi et al., 2017;. For instance, Linzen et al. (2016) uses number agreement to determine whether a language model using LSTM truly captures it. Khandelwal et al. (2018) evaluates how the distance between words affects the prediction in LSTM-LM. Weiss et al. (2018a) utilize the learned LSTM to construct deterministic automata. Furthermore, Avcu et al. (2017) control the complexity of long-range dependency using SP-k languages, and verify if LSTM can track them. Several studies have attempted to theoretically understand the learning ability of language models using RNNs, including LSTM and GRU (Cho et al., 2014;Chen et al., 2018;Weiss et al., 2018b).

Conclusion
In this paper, we empirically investigated various behaviors of LSTM on natural text by looking into its hidden state vectors. Contrary to previous work that deal with only artificial data, we clarified that updates u of the context vectors c are approximately discretized and accumulated in a low-dimensional subspace, leading to an approximate counter machines discussed in Section 4 and a clear representation of syntactic functions as shown in Section 5, in spite of the high dimensionality of state vectors explored in this study. Especially, we show that the representations of POS are acquired in the space of u rather than c and h in an unsupervised manner. The fact that the first principal component of PCA-uq for u encodes the difference between NP and VP is not only significant for understanding how LSTM-LM acquires ayntax, but also seen as a result of extracting the most important syntactic factor using LSTM with respect to the target language.