Unsupervised Labeled Parsing with Deep Inside-Outside Recursive Autoencoders

Understanding text often requires identifying meaningful constituent spans such as noun phrases and verb phrases. In this work, we show that we can effectively recover these types of labels using the learned phrase vectors from deep inside-outside recursive autoencoders (DIORA). Specifically, we cluster span representations to induce span labels. Additionally, we improve the model’s labeling accuracy by integrating latent code learning into the training procedure. We evaluate this approach empirically through unsupervised labeled constituency parsing. Our method outperforms ELMo and BERT on two versions of the Wall Street Journal (WSJ) dataset and is competitive to prior work that requires additional human annotations, improving over a previous state-of-the-art system that depends on ground-truth part-of-speech tags by 5 absolute F1 points (19% relative error reduction).

In this paper, we instead focus on labeled constituency parsing for English. The small number of previous works that exist in this area suffer from substantial weaknesses: 1) the models depend on ground-truth part-of-speech tags, which are not always available and known to boost constituency parsing scores (Kitaev and Klein, 2018), 2) none can simultaneously identify and label constituents (instead they typically depend on an external latent parser), and 3) they ignore sentences longer  (Drozdov et al., 2019). We are interested in clustering the learned vectors a(i, j) such that each span may be mapped to a phrase type. To enhance this clustering based approach, we augment the DIORA architecture with latent codes, shown in the right half of the figure.
than ten tokens because previous latent parsers do not scale to longer sentences (Haghighi and Klein, 2006;Borensztajn and Zuidema, 2007;Reichart and Rappoport, 2008). Unlike previous work, we achieve strong results in unlabeled constituency parsing using a single model for both bracketing and labeling. Our approach relies on clustering span representations, which are fixed-length continuous vectors learned end-to-end using DIORA and do not require external resources such as part-of-speech tags. Furthermore, we enhance the DIORA architecture with latent codes: the model learns a distribution over these codes that loosely aligns with the groundtruth assignment of phrase types and, more importantly, improves the quality of the clusters.
Our code-enhanced DIORA architecture outperforms DIORA and achieves a new state of the art of 76.7 F1 on WSJ-10 when labeling a gold bracketing (19% relative error reduction over the previous best model, Haghighi and Klein 2006, which unlike our approach uses gold partof-speech tags). Furthermore, we show DIORA is competitive when a ground truth bracketing is not provided, and instead must be induced. On the full WSJ test set, DIORA outperforms two strong baselines, ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2019). We analyze the clustered constituents and observe they are separated syntactically (i.e. past tense vs. present participle verbs) and semantically (i.e. time-related phrases vs. references to people).

DIORA: Deep Inside-Outside
Recursive Autoencoders DIORA is a recursive autoencoder that learns to reconstruct an input sentence. A fundamental step in the reconstruction is to build a chart using the inside-outside algorithm (Baker, 1979), which represents a soft weighting over all possible binary trees of the input sentence. For all the model details, we refer the reader to Drozdov et al. (2019). For this work, it is key to understand two capabilities that DIORA provides: each span in a sentence is represented as a vector and DIORA induces a maximally likely binary tree for the sentence. We can directly label the constituents of a sentence by clustering the learned span vectors from DIORA and assigning a label to each cluster. DIORA's autoencoder objective incentivizes the model to learn representations that compress the sentence well in order to reconstruct the input leading to the discovery of syntactic structure.
To encourage phrase representations to be easily clusterable into a small set of phrase types, we add an additional component to DIORA that forces phrase vectors to be representable by a small number of latent codes. Recent models have integrated ideas from vector quantization into variational autoencoders (Kingma and Welling, 2013) and key-value memory layers (Lample and Conneau, 2019), forcing the model to compress inputs into a single discrete latent embedding (van den Oord et al., 2017;Kaiser et al., 2018). Given a trained model, one could then assign labels to each of the latent variables and use this to label inputs directly.
We instead use a less restrictive modeling approach by assigning each input to a soft weighting over the K latent embeddings. This is similar to the soft EM training used by  and can be thought of analogously to fuzzy/soft K-means clustering (Dunn, 1974;Bezdek, 1981) rather than hard K-means clustering.
Implementation and training details for our model are described in Appendix A.1.

DIORA with Codebook
DIORA is constrained to binary trees and its composition is represented as: where i and j are neighboring spans,ā is summary vector for all possible parses over a span, and Compose is a function such as tree-LSTM or multi-layer perceptron.
To add the latent codebook into the model, we modify Eq. 1 to combine each constituent vector with a weighted summation over latent codes: where C is a codebook in R N ⇥M , x is a constituent vector in R M , and W is a bi-linear matrix used to compute the affinity between the constituent vector and the latent codes. One way to think of this equation is that each code (row in C) is a centroid, and the vector of affinity scores, (CW x), 1 is a soft assignment of the constituent vector over the latent codes. The modified DIORA equation when incorporating the codebook is: This codebook-enhanced architecture is visually depicted in Fig. 1. We use 70 codes when training this model (representing the 25 phrase types, 45 part-of-speech types, and ignoring the ROOT label), although we explore different configurations in §4.4.

Unsupervised Labeled Parsing
We perform unsupervised labeled constituency parsing with a multi-step approach.
Tree assignment. Assign a tree to each input sentence where the leaves of the tree are the words in the sentence. The tree is not labeled. This may be derived from the ground truth parse or induced using DIORA. When induced, we extract a binary tree by running the CKY algorithm 2 over DIORA's learned compatibility scores. Vector assignment. Assign the corresponding span vector to each constituent in these trees over the entire dataset. For DIORA without the codebook, this will be the concatenation of inside and outside vector. When using the codebook, this will be one of two options: the same as for DIORA, except using the output of Eq. 2, or it will be the soft score assignment of the codebook (CW x). The first option is referred to as DIORA CB and the soft score assignment as DIORA ⇤ CB . Cluster and label assignment. Cluster the collection of constituent vectors using K centroids learned with K-means. Finally, we use the ground truth phrase labels to assign each cluster to a phrase type -each constituent is mapped to the most common label within its cluster. We set K equal to the number of distinct phrase types in order to match previous work.

DIORA
We compare multiple configurations of DIORA. The first is the original model DIORA using the concatenation of the inside and outside vectors to represent a phrase. We also look at the codebookenhanced architecture DIORA CB , and when clustering the codebook scores we refer to the model as DIORA ⇤ CB .

Baselines
While ELMo (Peters et al., 2018a) (Borensztajn and Zuidema, 2007) is evaluated using more than K clusters (where K is the size of the tag set) by mapping ground truth labels to induced labels, therefore it is not strictly comparable to the other results. Neither BMM (Reichart and Rappoport, 2008) nor Proto are effective at inducing unlabeled structured, so depend on external latent parsers for the Induced evaluation, either CCM (Klein and Manning, 2002) or CCL (Seginer, 2007). ELMo and BERT do not induce structure whatsoever and depend on DIORA for the Induced evaluation. ELMo CI uses only the context-insensitive character embeddings produced by ELMo.
pervised labeled parsing. When a reference parse is provided, it is only necessary to derive ad-hoc phrase vectors using the contextualized token vectors from these models. Peters et al. (2018b) describe an effective way to do so for ELMo, which involves concatenating the token vectors at the beginning and end of the phrase. 4 For BERT, it is critical to look at all layers as lower layers tend to be more syntactic in nature (Tenney et al., 2019). For both models, we report the max F1 and mean F1, and for the Induced evaluation we use the parses extracted from DIORA.

WSJ
Unsupervised constituency parsing has often been evaluated on different splits of the WSJ. For labeled constituency parsing, models that produce binary trees as output have a performance ceiling on this n-ary data -unary-chains limit recall 5 and more-than-binary nodes limit precision. In some cases, an unlabeled tree structure over a sentence can be readily accessed. The algorithm described in §3 is robust to this case -simply replace the first step with the ground truth parse. We evaluate our model using the ground truth parse (Gold) and when inducing a parse (Induced). These results, comparison to baseline methods, and an upper-bound on binary tree performance are shown in Tables 1 and 2.
The Upper Bound in the Induced column of these tables represents a perfect labeling of the most accurate induced binary tree from DIORA, and the Majority (NP) row is the same tree labeled with the most common tag.

Model Ablations
As an alternative to clustering the constituent vectors with K-means, one can treat the codebook affinity scores, (C > W x), as a soft assignment over the clusters represented by each code. To examine this alternative, we replace K-means in the algorithm from §3 with the arg max over the affinity scores. A model trained with 25 codes 6 achieves greater than 60% recall at labeling the ground truth trees for WSJ-10, indicating the codes represent some syntactic patterns although not as effectively as when using K-means.
Given these results, we are curious to see how model performance changes as the number of codes varies. We train codebook DIORA with {25, 70, 100, 200, 300, 400} codes and evaluate each configuration using the procedure from §3 on the WSJ validation set. We compare the performance to non-codebook DIORA trained with {2, 3, 4, 5} layers. 7 Results are shown in Fig. 2.

Qualitative Analysis of Clusters
We investigate phrase clusters from a single experiment (DIORA CB on WSJ-10), which are assigned to 9 NP, 5 VP, 5 S, 4 PP, 1 ADJP and 1 QP, according to the majority gold labels in that clus-5 Labeled parsing is usually evaluated on whether a span has the correct label. An NP prediction for a span would be correct if there is an NP-QP or QP-NP unary-chain over this span. A binary tree could only ever get one of QP or NP correct in this case, hence limiting recall. 6 We use 25 codes here instead of 70 so that the model may be fairly compared with previous systems. 7 Elsewhere in this paper, DIORA uses two layers. ter. These 6 assigned phrase types correspond with the 6 most frequent labels. We find some semantic properties are evident in the clusters. For example, a 100% correct NP cluster (all phrases in this cluster have gold label NP) are all possessive NPs. One of the NP clusters consists of NPs that are mostly related to time (15 minutes, last year, this fall), even the incorrectly labeled phrases are time-related such as the ADVP 'no longer' and 'so far'. Another NP cluster identifies people, which includes "ms. parks 's mother" but excludes "mr. noriega 's proposal" even though both phrases have the same part-ofspeech tag sequence [NNP NNP POS NN].
One of the five VP clusters completely takes the form to + verb -7 out of 11 mislabeled cases contain "to" in the phrase, for example, "not to mention" (CONJP). The four other VP clusters present some degree of tense and singular/plural properties. A bar-chart showing the finer-grained properties of the VP clusters is shown in Fig. 3. Cluster 0 includes the majority of VBZ and MD (will, won't, can, could), Cluster 1 is mainly composed of past tense VPs (VBD), Cluster 2 has many VBG, and Cluster 3 consists of 86% VBP.
One of the S clusters captures instances of S that do not cover the whole sentence. Another starts with coordinating conjunctions such as "and" or "but", yet another captures phrases beginning with personal pronouns or determiners.

Conclusions
In this paper, we show that DIORA can be used for unsupervised labeled constituency parsing. We also introduce a new codebook-enhanced variant of DIORA that improves labeling performance. Our model outperforms the previous state of the art in unsupervised labeled constituency parsing for the WSJ-10 dataset, even though the previous best uses ground truth part-of-speech tags and ours does not, and introduces the first results on the full WSJ test set. The results indicate that grammar induction with types is viable using recent neuralnetwork-based models, and our analysis warrants further exploration in this area.