Finding Hierarchical Structure in Neural Stacks Using Unsupervised Parsing

Neural network architectures have been augmented with differentiable stacks in order to introduce a bias toward learning hierarchy-sensitive regularities. It has, however, proven difficult to assess the degree to which such a bias is effective, as the operation of the differentiable stack is not always interpretable. In this paper, we attempt to detect the presence of latent representations of hierarchical structure through an exploration of the unsupervised learning of constituency structure. Using a technique due to Shen et al. (2018a,b), we extract syntactic trees from the pushing behavior of stack RNNs trained on language modeling and classification objectives. We find that our models produce parses that reflect natural language syntactic constituencies, demonstrating that stack RNNs do indeed infer linguistically relevant hierarchical structure.


Introduction
Sequential models such as long short-term memory networks (LSTMs; Hochreiter and Schmidhuber, 1997) have been proven capable of exhibiting qualitative behavior that reflects a sensitivity to regularities that are structurally conditioned, such as subject-verb agreement (Linzen et al., 2016;Gulordava et al., 2018).However, detailed analysis of such models has shown that this apparent sensitivity to structure does not always generalize to inputs with a high degree of syntactic complexity (Marvin and Linzen, 2018).These observations suggest that sequential models may not in fact be representing sentences in the kind of hierarchically organized representations that we might expect.
Stack-structured recurrent memory units (Joulin and Mikolov, 2015;Grefenstette et al., 2015;Yo-gatama et al., 2018; and others) offer a possible method for explicitly biasing neural networks to construct hierarchical representations and make use of them in their computation.Since syntactic structures can often be modeled in a context-free manner (Chomsky, 1956(Chomsky, , 1957)), the correspondence between pushdown automata and contextfree grammars (Chomsky, 1962) makes stacks a natural data structure for the computation of hierarchical relations.Recently, Hao et al. (2018) have shown that stack-augmented RNNs (henceforth stack RNNs) have the ability to learn classical stack-based algorithms for computing contextfree transductions such as string reversal.However, they also find that such algorithms can be difficult for stack RNNs to learn.For many contextfree tasks such as parenthesis matching, the stack RNN models they consider instead learn heuristic "push-only" strategies that essentially reduce the stack to unstructured recurrent memory.Thus, even if stacks allow hierarchical regularities to be expressed, the bias that stack RNNs introduce does not guarantee that the networks will detect them.
The current paper aims to move beyond the work of Hao et al. (2018) in two ways.While that work was based on artificially generated formal languages, this paper considers the ability of stack RNNs to succeed on tasks over natural language data.Specifically, we train such networks on two objectives: language modeling and the number prediction task, a classification task proposed by Linzen et al. (2016) to determine whether or not a model can capture structure-sensitive grammatical dependencies.Further, in addition to using visualizations of the pushing and popping actions of the stack RNN to assess its hierarchical sensitivity, we use a technique proposed by Shen et al. (2018a,b) to assess the presence of implicitly-represented hierarchically-organized structure through the task of unsupervised parsing.We extract syntactic constituency trees from our models and find that they produce parses that broadly reflect phrasal groupings of words in the input sentences, suggesting that our models utilize the stack in a way that reflects the syntactic structures of input sentences.This paper is organized as follows.Section 2 introduces the architecture of our stack models, which extends the architecture of Grefenstette et al. (2015) by allowing multiple items to be pushed to, popped from, or read from the stack at each computational step.Section 3 then describes our training procedure and reports results on language modeling and agreement classification.Section 4 investigates the behavior of the stack RNNs trained on these tasks by visualizing their pushing behavior.Building on this, Section 5 describes how we adapt Shen et al.'s (2018a;2018b) unsupervised parsing algorithm to stack RNNs and evaluates the degree to which the resulting parses reveal structural representations in stack RNNs.Section 6 discusses our observations, and Section 7 concludes.

Network Architecture
In a stack RNN (Grefenstette et al., 2015;Hao et al., 2018), a neural network adhering to a standard recurrent architecture, known as a controller, is enhanced with a non-parameterized stack.At each time step, the controller network receives an input vector x t and a recurrent state vector h t−1 provided by the controller architecture, along with a read vector r t−1 summarizing the top elements on the stack.The controller interfaces with the stack by computing continuous values that serve as instructions for how the stack should be modified.These instructions consist of v t , a vector that is pushed to the top of the stack; d t , a number representing the strength of the newly pushed vector v t ; u t , the number of items to pop from the stack; and r t , the number of items to read from the top of the stack.The instructions v t , u t , d t , r t are produced by the controller as output and presented to the stack.The stack then computes the next read vector r t , which is given to the controller at the next time step.This general architecture is portrayed in Figure 1.In the next two subsections, we describe how the stack computes r t using the instructions v t , u t , d t , r t and how the controller computes the stack instructions.
Controller Stack

Stack Actions
A stack at time t consists of a sequence of vectors is the "top" element of the stack, while V t [1] is the "bottom" element.Each element V t [i] of the stack is associated with a strength s t [i] ≥ 0. The strength of a vector V t [i] represents the degree to which the vector is on the stack: a strength of 1 means that the vector is "fully" on the stack, while a strength of 0 means that the vector has been popped from the stack.The strengths are organized into a vector At time t, the stack receives a set of instructions v t , u t , d t , r t and performs three operations: popping, pushing, and reading, in that order.The popping operation is implemented by reducing the strength of each item on the stack by a number u t [i], ensuring that the strength of each item can never fall below 0.
The u t [i]s are computed as follows.The total amount of strength to be reduced is the pop strength u t .Popping begins by attempting to reduce the strength s t [t − 1] of the top item on the stack by the full pop strength u t .Thus, as shown below, then the ith item has been fully popped from the stack, "consuming" a portion of the pop strength of magnitude s t−1 [i].The strength of the next item is then reduced by an amount u t [i − 1] given by the "remaining" pop strength u t The pushing operation simply places the vector v t at the top of the stack with strength d t .Thus, V t and s t [t] are updated as follows.
have already been updated during the popping step.
Finally, the reading operation produces a "summary" of the top of the stack by computing a weighted sum of all the vectors on the stack.
The weights ρ t [i] are computed in a manner similar to the u t [i]s.The sum should include the top elements of the stack whose strengths add up to the read strength r t .The weight ρ t [t] assigned to the top item is initialized to the full read strength r t , while the weights ρ t [i] assigned to lower items are based on the "remaining" read strength ρ t [i + 1] − s t [i + 1] after strength has been assigned to higher items.

Stack Interface
The architecture of Grefenstette et al. (2015) assumes that the controller is a neural network of the form where h t is its state at time t, x t is its input, r t is the vector read from the stack at the previous step, and o t is an output vector used to produce the network output y t and the stack instructions v t , u t , d t , r t .The stack instructions v t , u t , d t , r t are computed as follows.The read strength r t is fixed to 1.The other values are determined by passing o t to specialized layers.The vectors y t and v t are computed using a tanh layer, while the scalar values u t and d t are obtained from a sigmoid layer.Thus, the push and pop strengths are constrained to values between 0 and 1. (1) This paper departs from Grefenstette et al.'s architecture by allowing for push, pop, and read operations to be executed with variable strength greater than 1.We achieve this by using an enhanced control interface inspired by Yogatama et al.'s (2018) Multipop Adaptive Computation Stack architecture.In that model, the controller determines how much weight to pop from the stack at each time step by computing a distribution P[u] describing the probability of popping u units from the stack.The next stack state V is computed as a superposition of the possible stack states V u resulting from popping u units from the stack, weighted by P[u].Our model follows Yogatama et al. in computing probability distributions over possible values of u t , d t , and r t .However, instead of superimposing stack states, which may hinder interpretability, we simply set the value of each instruction to be the expected value of its associated distribution.For a distribution vector p, define the operator E[p] as follows: denotes the expected value of p if we treat it as a distribution over {0, 1, . . ., k}.The maximum value k is fixed in advance as a hyperparameter of our model.The output y t and instructions v t , u t , d t , and r t are then computed as follows: The full architecture that we used for language modeling and agreement classification is a controller network which, at time t, reads the word x t as well as the previous stack summary r t−1 .These vectors are passed through an LSTM layer to produce the vector o t .Then, instructions for the stack are computed from o t according to the equations above.Finally, these instructions are executed to modify the stack state and produce the next stack summary vector r t .In our experiments, the size of the LSTM layer was 100, and the size of each stack vector was 16.
This paper considers models trained on a language modeling objective and a classification objective.On each objective, we train several neural stack models along with an LSTM baseline. 1This section describes the procedure used to train our models and presents the perplexity and classification values they attain on their training objectives.

Data and Training
Our models are trained using the Wikipedia corpus, a subset of the English Wikipedia used by Linzen et al. (2016) for their experiments.The classification task we consider is the number prediction task, proposed by Linzen et al. (2016) as a diagnostic for assessing whether or not LSTMs can infer grammatical dependencies sensitive to syntactic structure.In this task, the network is shown a sequence of words forming the beginning of a sentence from the Wikipedia corpus.The next word in the sentence is always a verb, and the network must predict whether the verb is singular (SG) or plural (PL).For example, on input The cats on the boat, the network must predict PL to match cats.We train and evaluate our models on the number prediction task using Linzen et al.'s (2016)  We used a model with very few parameters and basic setting of hyperparameters.The LSTM hidden state was fixed to a size of 100, while the vectors placed on the stack had size 16.Including the embedding layer, the Wikipedia model had 1,584,255 parameters.We used the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.001.The language models were trained for five epochs, while the agreement classifiers used an early stopping criterion.In addition to the LSTM baseline, for each task, we trained a stack RNN in which u t is fixed to 1 and d t ranges from 0 to k = 4, as well as a stack RNN in which d t fixed to 1 and u t ranges from 0 to k = 4. Additionally, for the classification task we trained a stack RNN in which u t ranges from 0 to k = 4 and d t is computed as in Equation 1.

Evaluation
Our language models are evaluated according to two metrics.Firstly, we reserve 10% of the Wikipedia corpus for evaluating test perplexity of the trained language models.Secondly, as a simple diagnostic of sensitivity to syntactic structure, we evaluate the performance of our Wikipediatrained language models on number agreement prediction (Linzen et al., 2016).Under this evaluation regime, we use our language model to simulate the number prediction task and compute the resulting classification accuracy.We do this by presenting the model with an input for the number prediction task and comparing the probabilities assigned to the verb that follows the input in the Wikipedia corpus.For example, if The cats on the boat purr appears in the Wikipedia corpus, then we present The cats on the boat to the language model and compare the probabilities assigned to the singular and plural forms purrs and purr, respectively.We consider the language model to make a correct prediction if the form of the next lexical item with the correct grammatical number (SG or PL) is predicted with greater probability than the alternative.
The number prediction classifiers we trained are evaluated according to classification accuracy.For each input sentence, we define the attractors of the input to be the nouns intervening between the subject and the verb whose number is being classified.For example, in the input The cat on the boat, cat is the subject of the following verb, while boat is an attractor.We compute the accuracy of our classifiers on the full testing set of the simple dependency data set as well as subsets of the testing set consisting of sentences with a fixed number of attractors.

Training Results
Table 1 shows the quantitative results for our language models.The stack RNN is comparable to our LSTM baseline in terms of language modeling perplexity and agreement prediction accuracy when u t is fixed to 1, though the latter per-  forms slightly better according to both metrics.
The stack RNN attains a significantly worse perplexity when d t is fixed to 1, and its agreement prediction accuracy is worse than both the LSTM baseline and the stack RNN with u t = 1.
Table 2 shows test accuracies attained by classifiers trained on the number prediction task.While the stack classifier with u t fixed to 1 and the LSTM baseline achieve the best overall accuracy, the stack with unrestricted u t and sigmoid d t and the stack with d t fixed to 1 exceed the baseline on sentences with at least 2 attractors.We take this to suggest that the hierarchical bias provided by the stack can improve performance on syntactically complex cases.

Interpreting Stack Usage
The results presented in Subsection 3.3 show that the u t = 1 stack RNNs perform comparably to LSTMs in terms of quantitative evaluation metrics.The goal of this section is to assess whether or not stack RNNs achieve this level of perfor-mance in an interpretable manner.We do this by visualizing the push and read strengths computed by the u t = 1 language model when processing two example sentences.These visualizations are shown in Figure 2 and Figure 3. Notice that the push strength tends to spike following words with subcategorization requirements.For example, the preposition in and the transitive verbs eat and is both require NP objects, and accordingly the model assigns a high push strength to these words.This suggests that the model is using the stack to capture hierarchical dependencies by keeping track of words that predictably introduce various kinds of phrases.
Figure 4 shows push strengths computed by the u t = 1 language model, aggregated across the entire Wikipedia corpus.We see that push strengths differ systematically based on part of speech.The distribution of push strengths computed by the network upon seeing a noun is tightly concentrated around 0.5, whereas the push strength upon seeing a verb tends to be greater-usually more than 2.5.This phenomenon reflects the fact that verbs typically take objects while nouns do not.
We also find that push strengths assigned to verbs depend on their transitivity.The right panel of Figure 4 shows push strength distributions for a collection of common transitive and intransitive verbs.The model distinguishes between these two types of verbs by assigning high push strengths to transitive verbs and low push strengths to intransitive verbs.We make similar observations for other parts of speech: prepositions, which take objects, typically receive higher push strengths, while determiners and adjectives, which do not take phrasal complements, receive lower push strengths.

Inference of Syntactic Structure
Section 4 has shown that the push strengths d t computed by the u t = 1 language model reflect the subcategorization requirements of the words encountered by the network.Based on this phenomenon, we may interpret the stack to be keeping track of phrases that are "in progress."A high push strength induced by a transitive verb, for example, may be thought to indicate that a verb phrase has begun, and that this phrase ends when the object of the verb is seen.We thus hypothesize that for each time step t, d t represents the size of the phrase that begins with the word read by the network at time Figure 3: Distributions for push and read strengths at each step of processing example sentences.For example, the push strength chosen after processing the (0.46) is the expected value of the blue distribution in the far left plot.

All Words
Nouns and Verbs Transitive vs. Intransitive Verbs t.If d t is low, then this phrase consists of a single word; if it is high, then this is a complex phrase consisting of multiple words.
A similar intuition underlies the unsupervised parsing framework of Shen et al. (2018a,b).Under this framework, constituency structure is induced from a sequence of words by computing a syntactic distance between every two adjacent words.Intuitively, the syntactic distance between two words measures the distance from the lowest common parent node of the two words to the bottom of the tree.If two words have a low syntactic distance, then they are likely siblings in a small constituent; if they have a high syntactic distance, then they probably belong to different phrases.Whereas Figure 2 and Figure 3 allow us identify specific time steps at which the stack recognizes the beginning of a phrase, the unsupervised parsing framework allows us to explicitly visualize the phrasal organization of input sequences induced by our interpretation of the push strengths.
Given an input sequence x 1 , x 2 , . . ., x n , we define the syntactic distance between each x t and x t−1 for our u t = 1 model to be the push strength d t computed by the controller during time t.If the current word does not open any new constituents, then it belongs to the same constituent as the previous word, and therefore should be assigned a low syntactic distance.On the other hand, if the current word opens a complex constituent, then it is lower in the parse tree than the previous word, and therefore should be assigned a high syntactic distance.Similarly, for our d t = 1 model, we let u t be the syntactic distance between x t and x t+1 .Under this interpretation, the pop strength estimates the complexity of the constituents that the current word closes.If the current word closes many complex constituents, then the next word appears at a higher level in the parse tree, and is therefore syntactically distant from the current word.

2:
if X has at most one word then Algorithm 1 shows our procedure for constructing trees.The algorithm takes as input a sequence of words arranged into a matrix X and a vector d containing the syntactic distance between each word and the previous word.Following Shen et al. (2018a,b), we recursively split X into binary constituents.At each recursion level, we greedily choose the word with the highest syntactic distance as the split point.The final output is a binary tree spanning the full sentence.

Evaluation
We compute F1 scores for the parses obtained from our Wikipedia language models by comparing against parses from Section 23 of the Penn Treebank's Wall Street Journal corpus (WSJ23, Marcus et al., 1994).Since Algorithm 1 produces unlabeled binary trees, our evaluation uses the gold standard of Htut et al. (2018), which consists of unlabeled, binarized versions of the WSJ23 trees.We also decapitalize the first word of every sentence for compatibility with our training data.
As a baseline, we the F1 scores attained by our models to those computed for purely right-and left-branching trees.A right-branching parse is equivalent to the output of Algorithm 1 on a sequence of equal syntactic distances.Thus, the difference between the right-branching F1 score and our models' scores is a measure of the amount of syntactic information encoded by the push and pop strength sequences.We also compare our  (Htut et al., 2018) 26.3 Best PRPN-LM (Htut et al., 2018) 37.4 Table 3: Unsupervised parsing performance evaluated on the WSJ23 dataset, attained by our stack models (top), the right-and left-branching baselines (middle), and the PRPN models (bottom).
F1 scores to the results of Htut et al.'s (2018) replication study for the parsing-reading-predict network models (PRPN-LM and PRPN-UP), the two syntactic-distance-based unsupervised parsers originally proposed by Shen et al. (2018a).

Results
The F1 evaluation (see Table 3) shows that our Wikipedia model with u t = 1 significantly outperforms the baseline on the Penn Treebank, while our model with d t = 1 performs slightly better than the baseline.This is evidence that the types of hierarchical structures produced by Algorithm 1 resemble expert-annotated constituency parses.Our results do not exceed those of Htut et al.'s (2018) replication study.It is worth noting that our right-and left-branching baseline scores are somewhat lower than theirs.This suggests that differences in data processing or implementation might make our evaluation more difficult.Regardless, we consider our results to still be somewhat competitive, given that our language models were trained on out-of-domain data with few parameters and minimal hyperparameter tuning.
We provide example parses extracted from the stack RNN language models with u t = 1 in Figure 5. Overall, our unsupervised parses tend to resemble the gold-standard parses with some differences.Periods in our parses systematically attach lower in the structure in our extracted parses than in the gold-standard trees.High attachment would require a high syntactic distance (i.e., high push strength) between the period and the remainder of the sentence.However, the period inherently does not have any subcategorization requirements, so it induces a low push strength.In contrast, prepositional phrases attach higher in our structures than in the gold parses.This may be the result of fixed subcategorization-associated push strengths for prepositions that give rise to fairly high esti- mates of syntactic distance.

Discussion
Overall, our stack language models show no improvement over the LSTM baseline in terms of perplexity and classification accuracy.Although the u t = 1 language model is comparable to the LSTM on these metrics, it ultimately achieves worse scores than the baseline.However, we have now seen that the pushing behavior of the model reflects subcategorization properties of lexical items that play an important role in determining their syntactic behavior, and that these properties allow reasonable parses to be extracted from this model.These observations show that the u t = 1 model has learned to encode structural representations using the stack.Quantitatively, the importance of this structural information for the training objectives can be seen in Table 2, where the stack at least partially alleviates the difficulty experienced by the LSTM classifier in handling syntactically complex inputs.
While our stack language models do not exceed the LSTM baseline in terms of perplexity and agreement accuracy, Yogatama et al. (2018) find that their Multipop Adaptive Computation Stack architecture substantially outperforms a bare LSTM on these metrics.Compared to their models, we use fewer parameters and minimal hyperparameter tuning.Thus, it is possible that increasing the number of parameters in our controller may lead to similar increases in performance in addition to the structural interpretability that we have observed.

Conclusion
The results reported here point to the conclusion that stack RNNs trained on corpora of natural language text do in fact learn to encode sentences in a hierarchically organized fashion.We show that the sequence of stack operations used in the processing of a sentence lets us uncover a syntactic structure that matches standardly assigned structure reasonably well, even if the addition of the stack does not improve the stack RNN's performance over the LSTM baseline in terms of the language modeling objective.We also find that using the stack RNN to predict the grammatical number of a verb results in better hierarchical generalizations in syntactically complex cases than is possible with stackless models.Taken together, these results suggest that the stack RNN model yields comparable performance to other architectures, while producing structural representations that are easier to interpret and that show signs of being linguistically natural.

Figure 2 :
Figure 2: Push and read strengths computed by the u t = 1 language model.Values underneath each word show the total strength remaining on the stack at that step.

Figure 4 :
Figure 4: Distributions of d t for the u t = 1 language model over all test sentences.The center panel shows the distributions of d t for nouns and verbs, and the right panel shows the distributions for selected transitive and intransitive verbs.
simple dependency dataset, which contains 141,948 training examples, 15,772 validation examples, and 1,419,491 testing examples.

Table 1 :
Results for language models trained on the Wikipedia dataset.

Table 2 :
Number prediction accuracies attained by the three stack RNN classifiers and the LSTM baseline.