Learning Syntax from Naturally-Occurring Bracketings

Naturally-occurring bracketings, such as answer fragments to natural language questions and hyperlinks on webpages, can reflect human syntactic intuition regarding phrasal boundaries. Their availability and approximate correspondence to syntax make them appealing as distant information sources to incorporate into unsupervised constituency parsing. But they are noisy and incomplete; to address this challenge, we develop a partial-brackets-aware structured ramp loss in learning. Experiments demonstrate that our distantly-supervised models trained on naturally-occurring bracketing data are more accurate in inducing syntactic structures than competing unsupervised systems. On the English WSJ corpus, our models achieve an unlabeled F1 score of 68.9 for constituency parsing.


Introduction
Constituency is a foundational building block for phrase-structure grammars. It captures the notion of what tokens can group together and act as a single unit. The motivating insight behind this paper is that constituency may be reflected in markups of bracketings that people provide in doing natural tasks. We term these segments naturallyoccurring bracketings for their lack of intended syntactic annotation. These include, for example, the segments people pick out from sentences to refer to other Wikipedia pages or to answer semanticallyoriented questions; see Figure 1 for an illustration.
Gathering such data requires low annotation expertise and effort. On the other hand, these data are not necessarily suitable for training parsers, as they often contain incomplete, incorrect and sometimes conflicting bracketing information. It is thus an empirical question whether and how much we Figure 1: Two example types of naturally-occurring bracketings. Blue underlined texts in the Wikipedia sentence are hyperlinks. We bracket the QA-SRL sentence in matching colors according to the answers. could learn syntax from these naturally-occurring bracketing data.
To overcome the challenge of learning from this kind of noisy data, we propose to train discriminative constituency parsers with structured ramp loss (Do et al., 2008), a technique previously adopted in machine translation (Gimpel and Smith, 2012). Specifically, we propose two loss functions to directly penalize predictions in conflict with available partial bracketing data, while allowing the parsers to induce the remaining structures.
We experiment with two types of naturallyoccurring bracketing data, as illustrated in Figure 1. First, we consider English question-answer pairs collected for semantic role labeling (QA-SRL;He et al., 2015). The questions are designed for non-experts to specify semantic arguments of predicates in the sentences. We observe that although no syntactic structures are explicitly asked for, humans tend to select constituents in their answers. Second, Wikipedia articles 2 are typically richly annotated with internal links to other articles. These links are marked on phrasal units that refer to standalone concepts, and similar to the QA-SRL data, they frequently coincide with syntactic constituents.
Experiment results show that naturally-occurring bracketings across both data sources indeed help our models induce syntactic constituency structures. Training on the QA-SRL bracketing data achieves an unlabeled F1 score of 68.9 on the English WSJ corpus, an accuracy competitive with state-of-theart unsupervised constituency parsers that do not utilize such distant supervision data. We find that our proposed two loss functions have slightly different interactions with the two data sources, and that the QA-SRL and Wikipedia data have varying coverage of phrasal types, leading to different error profiles.
In sum, through this work, (1) we demonstrate that naturally-occurring bracketings are helpful for inducing syntactic structures, (2) we incorporate two new cost functions into structured ramp loss to train parsers with noisy bracketings, and (3) our distantly-supervised models achieve results competitive with the state of the art of unsupervised constituency parsing despite training with smaller data size (QA-SRL) or out-of-domain data (Wikipedia).

Naturally-Occurring Bracketings
Constituents are naturally reflected in various human cognitive processes, including speech production and perception (Garrett et al., 1966;Gee and Grosjean, 1983), reading behaviors (Hale, 2001;Boston et al., 2008), punctuation marks (Spitkovsky et al., 2011), and keystroke dynamics (Plank, 2016). Conversely, these externalized signals help us gain insight into constituency representations. We consider two such data sources: a) Answer fragments When questions are answered with fragments instead of full sentences, those fragments tend to form constituents. This phenomenon corresponds to a well-established constituency test in the linguistics literature (Carnie, 2012, pg. 98, inter alia). 2 We worked with articles in English.

Dataset
QA-SRL Wikipedia  b) Webpage hyperlinks Since a hyperlink is a pointer to another location or action (e.g., mailto: links), anchor text often represents a conceptual unit related to the link destination. Indeed, Spitkovsky et al. (2010) first give empirical evidence that around half of the anchor text instances in their data respects constituent boundaries and Søgaard (2017) demonstrates that hyperlink data can help boost chunking accuracy in a multi-task learning setup. Both types of data have been considered in previous work on dependency-grammar induction (Spitkovsky et al., 2010;Naseem and Barzilay, 2011), and in this work, we explore their efficacy for learning constituency structures.
For answer fragments, we use He et al.'s (2015) question-answering-driven semantic role labeling (QA-SRL) dataset, where annotators answer whquestions regarding predicates in sentences drawn from the Wall Street Jounal (WSJ) section of the Penn Treebank (PTB; Marcus et al., 1993). For hyperlinks, we used a 1% sample of 2020-05-01 English Wikipedia, retaining only within-Wikipedia links. 3 We compare our extracted naturally-occurring bracketings with the reference phrase-structure annotations: 4 Table 1 gives relevant statistics. Our results re-affirm Spitkovsky et al.'s (2010) finding that a large proportion of hyperlinks coin-cide with syntactic constituents. We also find that 22.4%/35.8% of the natural bracketings are singleword spans, which cannot facilitate parsing decisions, while 11.8%/5.3% of QA-SRL/Wikipedia spans actually conflict with the reference trees and can thus potentially harm training. The QA-SRL data seems more promising for inducing betterquality syntactic structures, as there are more bracketings available across a diverse set of constituent types.

Parsing Model
Preliminaries The inputs to our learning algorithm are tuples (w, B), where w = w 1 , . . . , w n is a length-n sentence and B = {(b k , e k )} is a set of naturally-occurring bracketings, denoted by the beginning and ending indices b k and e k into the sentence w. As a first step, we extract BERT-based contextualized word representations (Devlin et al., 2019) to associate each token w i with a vector x i . 5 See Appendix B for details.
Scoring Spans Based on the x i vectors, we assign a score s ij to each candidate span (i, j) in the sentence indicating its appropriateness as a constituent in the output structure. We adopt a biaffine scoring function (Dozat and Manning, 2017): where [v; 1] appends 1 to the end of vector v, and are the outputs of multi-layer perceptrons (MLPs) that take the vectors at span boundaries as inputs. 6 Decoding We define the score s(y) of a binarybranching constituency tree y to be the sum of scores of its spans. The best scoring tree among all valid trees Y can be found using the CKY algorithm (Cocke, 1969;Kasami, 1965;Younger, 1967).
Learning Large-margin training (Taskar et al., 2005) is a typical choice for supervised training of constituency parsers. It defines the following loss function to encourage a large margin of at least ∆(y, y * ) between the gold tree y * and any predicted tree y: where ∆(y, y * ) is a distance measure between y and y * . We can reuse the CKY decoder for costaugmented inference when the distance decomposes into individual spans with some function c: In our setting, we do not have access to the goldstandard y * , but instead we have a set of bracketingsỹ. The scoring s(ỹ) is not meaningful sinceỹ is not a complete tree, so we adopt structured ramp loss (Do et al., 2008;Gimpel and Smith, 2012) and define using a combination of cost-augmented and costdiminished inference. This loss function can be understood as a sum of a convex and a concave large margin loss (Collobert et al., 2006), canceling out the term for directly scoring the gold-standard tree. We consider two methods for incorporating the partial bracketings into the cost functions: where 1 is an indicator function. c loose is more lenient than c strict as it does not penalize spans that do not conflict withỹ. Both cost definitions promote structures containing bracketings inỹ. 7 In the supervised setting whereỹ refers to a fully-annotated tree y * without conflicting span boundaries, c strict is equal to c loose and the resulting ∆(y, y * ) cost functions both correspond to the Hamming distance between y and y * .

Experiments and Results
Data and Implementation We evaluate on the PTB (Marcus et al., 1993)    (section 23 as the test set). QA-SRL contains 1,241 sentences drawn from the training split (sections 02-21) of the PTB. For Wikipedia, we use a sample of 332,079 sentences that are within 100 tokens long and contain multi-token internal hyperlinks. We fine-tune the pretrained BERT base features with a fixed number of mini-batch updates and report results based on five random runs for each setting. See Appendix B for detailed hyper-parameter settings and optimization procedures.
Evaluation We follow the evaluation setting of Kim et al. (2019a). More specifically, we discard punctuation and trivial spans (single-word and full-sentence spans) during evaluation and report sentence-level F1 scores as our main metrics.
Results Table 2 shows the evaluation results of our models trained on naturally-occurring bracketings (NOB); Table 3 breaks down the recall ratios for each constituent type. Our distantlysupervised models trained on QA-SRL are competitive with the state-of-the-art unsupervised results. When comparing our models with Cao et al. (2020), we obtain higher recalls on most constituent types except for VPs. Interestingly, QA-SRL data prefers c strict , while c loose gives better F1 score on Wikipedia; this correlates with the fact that QA-SRL has more bracketings per sentence (Table 1). Finally, our Wikipedia data has a larger relative percentage of ADJP bracketings, which explains the higher ADJP recall of the models trained on Wikipedia, despite their lower overall recalls.

Related Work
Unsupervised Parsing Our distantly-supervised setting is similar to unsupervised in the sense that it does not require syntactic annotations. Typically, lack of annotations implies that unsupervised parsers induce grammar from a raw stream of lexical or part-of-speech tokens (Clark, 2001;Klein, 2005)  The models are usually generative and learn from (re)constructing sentences based on induced structures (Shen et al., 2018(Shen et al., , 2019Drozdov et al., 2019;Kim et al., 2019a,b). Alternatively, one may use reinforcement learning to induce syntactic structures using rewards defined by end tasks (Yogatama et al., 2017;Choi et al., 2018;Havrylov et al., 2019). Our method is related to learning from constituency tests (Cao et al., 2020), but our use of bracketing data permits discriminative parsing models, which focus directly on the syntactic objective.
Learning from Partial Annotations Full syntactic annotations are costly to obtain, so the alternative solution of training parsers from partiallyannotated data has attracted considerable research attention, especially within the context of active learning for dependency parsing (Sassano, 2005;Sassano and Kurohashi, 2010;Mirroshandel and Nasr, 2011;Flannery et al., 2011;Flannery and Mori, 2015;Li et al., 2016;Zhang et al., 2017) and grammar induction for constituency parsing (Pereira and Schabes, 1992;Hwa, 1999;Riezler et al., 2002). These works typically require expert annotators to generate gold-standard, though partial, annotations. In contrast, our work considers the setting and the challenge of learning from noisy bracketing data, which is more comparable to Spreyer and Kuhn (2009) and Spreyer et al. (2010) on transfer learning for dependency parsing.

Conclusion and Future Work
We argue that naturally-occurring bracketings are a rich resource for inducing syntactic structures. They reflect human judgment of what constitutes a phrase and what does not. More importantly, they require low annotation expertise and effort; for example, webpage hyperlinks can be extracted essentially for free. Empirically, our models trained on QA-SRL and Wikipedia bracketings achieve competitive results with the state of the art on unsupervised constituency parsing. Structural probes have been successful in extracting syntactic knowledge from frozen-weight pre-trained language models (e.g., Hewitt and Manning, 2019), but they still require direct syntactic supervision. Our work shows that it is also feasible to retrieve constituency trees from BERT-based models using distant supervision data.
Our models are limited to the unlabeled setting, and we leave it to future work to automatically cluster the naturally-occurring bracketings and to induce phrase labels. Our work also points to potential applications in (semi-)supervised settings including active learning and domain adaptation (Joshi et al., 2018). Future work can also consider other naturally-occurring bracketings induced from sources such as speech production, reading behavior, etc. For all question-answer pairs, we first map the answers to consecutive spans in the corresponding sentences. We keep all exact matches when the answer text appears multiple times in the sentence, and we discard any answers that cannot be mapped to a consecutive span in the sentence.

A.2 Wikipedia
We randomly sample 1% of the articles from the 2020-05-01 snapshot of English Wikipedia 8 . We then split the documents into sentences and tokenize with spaCy. 9 This step leads to 926,077 sentences, as reported in Table 1. For ground-truth parse trees, we parse the sentences with Kitaev et al.'s (2019) state-of-the-art constituency parser trained on the PTB. For all the internal hyperlinks in the documents, where there is a hyperlinktokenization mismatch, we retrieve the smallest span of tokens that covers the hyperlink. To construct the training set in our main experiments, we filter out sentences longer than 100 tokens and sentences without any multiple-token internal hyperlinks. These pre-processing procedures produce 332,079 training sentences.

B Implementation Details
Feature Extractor We use the pretrained BERT base model as our feature extractor. 10 For each word in the sentence, we tokenize it with BERT's WordPiece tokenizer, and we take the BERT vector of the last token at the final BERT hidden layer as representation for each word. The feature extractor is fine-tuned along with model training.
Span Scoring MLP left and MLP right are singlelayer MLPs: they both consist of a linear layer projecting BERT representations to 256-dimensional vectors, followed by a leaky ReLU activation function (Maas et al., 2013). The constituent scoring component has parameter W ∈ R 257×257 . All the parameters are randomly initialized (Glorot and Bengio, 2010).
Training and Optimization We optimize the neural networks using the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.999 and = 1 × 10 −12 . For each batch, we sample 8 sentences from the training set and average the loss collected for each sentence. The gradients are clipped at 1.0 before back propagation. The learning rate linearly increases from zero to 1 × 10 −5 in 2,000 training steps. After warmup, we keep training the model until we reach 20,000 training steps. We do not perform early stopping, since in the unsupervised parsing setting, we do not look at validation accuracies until we finish training. We leave it as future work to explore other model selection strategies.
Hyperparameter Selection We use the default recommended β 1 , β 2 , and values for the Adam optimizer, and we use a typical fine-tuning learning rate for the pre-trained BERT model (Devlin et al., 2019). The number of training steps is based on our preliminary observation of the convergence of the training loss, and the batch size is limited by our computating hardware. We fix the initial values we set for the size of the biaffine matrix (257 × 257) and the number of warmup steps (2,000) throughout our experiments. A better hyperparameter selection strategy may lead to improved results.
Speed For a length-n sentence, the time complexity for the CKY decoder is O(n 3 ). On a RTX 2080 GPU, our model parses 409 sentences per second on average and the training process for each model finishes within 2 hours.