Syntactic Scaffolds for Semantic Structures

We introduce the syntactic scaffold, an approach to incorporating syntactic information into semantic tasks. Syntactic scaffolds avoid expensive syntactic processing at runtime, only making use of a treebank during training, through a multitask objective. We improve over strong baselines on PropBank semantics, frame semantics, and coreference resolution, achieving competitive performance on all three tasks.


Introduction
As algorithms for the semantic analysis of natural language sentences have developed, the role of syntax has been repeatedly revisited. Linguistic theories have argued for a very tight integration of syntactic and semantic processing (Steedman, 2000;Copestake and Flickinger, 2000), and many systems have used syntactic dependency or phrase-based parsers as preprocessing for semantic analysis (Gildea and Palmer, 2002;Punyakanok et al., 2008;Das et al., 2014). Meanwhile, some recent methods forgo explicit syntactic processing altogether (Zhou and Xu, 2015;Peng et al., 2017).
Because annotated training datasets for semantics will always be limited, we expect that syntax-which offers an incomplete but potentially useful view of semantic structure-will continue to offer useful inductive bias, encouraging semantic models toward better generalization. We address the central question: is there a way for semantic analyzers to benefit from syntax without the computational cost of syntactic parsing?
We propose a multitask learning approach to incorporating syntactic information into learned representations of neural semantics models ( §2). Our approach, the syntactic scaffold, minimizes an auxiliary supervised loss function, derived from a syntactic treebank. The goal is to steer the distributed, contextualized representations of words and spans toward accurate semantic and syntactic labeling. We avoid the cost of training or executing a full syntactic parser, and at test time (i.e., runtime in applications) the semantic analyzer has no additional cost over a syntax-free baseline. Further, the method does not assume that the syntactic treebank overlaps the dataset for the primary task.
Many semantic tasks involve labeling spans, including semantic role labeling (SRL; Gildea and Jurafsky, 2002) and coreference resolution (Ng, 2010) (tasks we consider in this paper), as well as named entity recognition and some reading comprehension and question answering tasks (Rajpurkar et al., 2016). These spans are usually syntactic constituents (cf. PropBank; Palmer et al., 2005), making phrase-based syntax a natural choice for a scaffold. See Figure 1 for an example sentence with syntactic and semantic annotations. Since the scaffold task is not an end in itself, we relax the syntactic parsing problem to a collection of independent span-level predictions, with no constraint that they form a valid parse tree. This means we never need to run a syntactic parsing algorithm.
Our experiments demonstrate that the syntactic scaffold offers a substantial boost to state-of-theart baselines for two SRL tasks ( §5) and coreference resolution ( §6). Our models use the strongest available neural network architectures for these tasks, integrating deep representation learning  and structured prediction at the level of spans (Kong et al., 2016). For SRL, the base-  Figure 1: An example sentence with syntactic, PropBank and coreference annotations from OntoNotes, and author-annotated frame-semantic structures. PropBank SRL arguments and coreference mentions are annotated on top of syntactic constituents. All but one frame-semantic argument (Event) is a syntactic constituent. Targets evoke frames shown in the color-coded layers.
line itself is a novel globally normalized structured conditional random field, which outperforms the previous state of the art. 1 Syntactic scaffolds result in further improvements over prior work-3.6 absolute F 1 in FrameNet SRL, 1.1 absolute F 1 in PropBank SRL, and 0.6 F 1 in coreference resolution (averaged across three standard scores).
Our code is open source and available at https: //github.com/swabhs/scaffolding.

Syntactic Scaffolds
Multitask learning (Caruana, 1997) is a collection of techniques in which two or more tasks are learned from data with at least some parameters shared. We assume there is only one task about whose performance we are concerned, denoted T 1 (in this paper, T 1 is either SRL or coreference resolution). We use the term "scaffold" to refer to a second task, T 2 , that can be combined with T 1 during multitask learning. A scaffold task is only used during training; it holds no intrinsic interest beyond biasing the learning of T 1 , and after learning is completed, the scaffold is discarded.
A syntactic scaffold is a task designed to steer the (shared) model toward awareness of syntactic structure. It could be defined through a syntactic parser that shares some parameters with T 1 's model. Since syntactic parsing is costly, we use simpler syntactic prediction problems (discussed below) that do not produce whole trees.
As with multitask learning in general, we do not assume that the same data are annotated with outputs for T 1 and T 2 . In this work, T 2 is defined using phrase-structure syntactic annotations from OntoNotes 5.0 (Weischedel et al., 2013;. We experiment with three settings: one where the corpus for T 2 does not overlap with the training datasets for T 1 (frame-SRL) and two where there is a complete overlap (PropBank SRL and coreference). Compared to approaches which require multiple output labels over the same data, we offer the major advantage of not requiring any assumptions about, or specification of, the relationship between T 1 and T 2 output.

Related Work
We briefly contrast the syntactic scaffold with existing alternatives.
Pipelines. In a typical pipeline, T 1 and T 2 are separately trained, with the output of T 2 used to define the inputs to T 1 (Wolpert, 1992). Using syntax as T 2 in a pipeline is perhaps the most common approach for semantic structure prediction (Toutanova et al., 2008;Yang and Mitchell, 2017;Wiseman et al., 2016). 2 However, pipelines introduce the problem of cascading errors (T 2 's mistakes affect the performance, and perhaps the training, of T 1 ; He et al., 2013). To date, remedies to cascading errors are so computationally expensive as to be impractical (e.g., Finkel et al., 2006). A syntactic scaffold is quite different from a pipeline since the output of T 2 is never explicitly used.

Latent variables.
Another solution is to treat the output of T 2 as a (perhaps structured) latent variable. This approach obviates the need of supervision for T 2 and requires marginalization (or some approximation to it) in order to reason about the outputs of T 1 . Syntax as a latent variable for semantics was explored by Zettlemoyer and Collins (2005) and Naradowsky et al. (2012). Apart from avoiding marginalization, the syntactic scaffold offers a way to use auxiliary syntacticallyannotated data as direct supervision for T 2 , and it need not overlap the T 1 training data.
Joint learning of syntax and semantics. The motivation behind joint learning of syntactic and semantic representations is that any one task is helpful in predicting the other (Lluís and Màrquez, 2008;Lluís et al., 2013;Henderson et al., 2013;Swayamdipta et al., 2016). This typically requires joint prediction of the outputs of T 1 and T 2 , which tends to be computationally expensive at both training and test time.
Part of speech scaffolds. Similar to our work, there have been multitask models that use partof-speech tagging as T 2 , with transition-based dependency parsing (Zhang and Weiss, 2016) and CCG supertagging (Søgaard and Goldberg, 2016) as T 1 . Both of the above approaches assumed parallel input data and used both tasks as supervision. Notably, we simplify our T 2 , throwing away the structured aspects of syntactic parsing, whereas part-of-speech tagging has very little structure to begin with. While their approach results in improved token-level representations learned via supervision from POS tags, these must still be composed to obtain span representations. In-stead, our approach learns span-level representations from phrase-type supervision directly, for semantic tasks. Additionally, these methods explore architectural variations in RNN layers for including supervision, whereas we focus on incorporating supervision with minimal changes to the baseline architecture. To the best of our knowledge, such simplified syntactic scaffolds have not been tried before.
Word embeddings. Our definition of a scaffold task almost includes stand-alone methods for estimating word embeddings (Mikolov et al., 2013;Pennington et al., 2014;Peters et al., 2018). After training word embeddings, the tasks implied by models like the skip-gram or ELMo's language model become irrelevant to the downstream use of the embeddings. A noteworthy difference is that, rather than pre-training, a scaffold is integrated directly into the training of T 1 through a multitask objective.
Multitask learning. Neural architectures have often yielded performance gains when trained for multiple tasks together (Collobert et al., 2011;Luong et al., 2015;Chen et al., 2017;Hashimoto et al., 2017). In particular, performance of semantic role labeling tasks improves when done jointly with other semantic tasks (FitzGerald et al., 2015;Peng et al., 2017Peng et al., , 2018. Contemporaneously with this work, Hershcovich et al. (2018) proposed a multitask learning setting for universal syntactic dependencies and UCCA semantics (Abend and Rappoport, 2013). Syntactic scaffolds focus on a primary semantic task, treating syntax as an auxillary, eventually forgettable prediction task.

Syntactic Scaffold Model
We assume two sources of supervision: a corpus D 1 with instances x annotated for the primary task's outputs y (semantic role labeling or coreference resolution), and a treebank D 2 with sentences x, each with a phrase-structure tree z.

Loss
Each task has an associated loss, and we seek to minimize the combination of task losses, with respect to parameters, which are partially shared, where δ is a tunable hyperparameter. In the rest of this section, we describe the scaffold task. We define the primary tasks in Sections 5-6. Each input is a sequence of tokens, x = x 1 , x 2 , . . . , x n , for some n. We refer to a span of contiguous tokens in the sentence as x i: j = x i , x i+1 , . . . , x j , for any 1 i j n. In our experiments we consider only spans up to a maximum length D, resulting in O(nD) spans.
Supervision comes from a phrase-syntactic tree z for the sentence, comprising a syntactic category z i: j ∈ C for every span x i: j in x (many spans are given a null label). We experiment with different sets of labels C ( §4.2).
In our model, every span x i: j is represented by an embedding vector v i: j (see details in §5.3). A distribution over the category assigned to z i: j is derived from v i: j : where w c is a parameter vector associated with category c. We sum the log loss terms for all the spans in a sentence to give its loss:

Labels for the Syntactic Scaffold Task
Different kinds of syntactic labels can be used for learning syntactically-aware span representations: • Constituent identity: C = {0, 1}; is a span a constituent, or not? • Non-terminal: c is the category of a span, including a null for non-constituents. • Non-terminal and parent: c is the category of a span, concatenated with the category of its immediate ancestor. null is used for nonconstituents, and for empty ancestors. • Common non-terminals: Since a majority of semantic arguments and entity mentions are labeled with a small number of syntactic categories, 3 we experiment with a threeway classification among (i) noun phrase (or prepositional phrase, for frame SRL); (ii) any other category; and (iii) null. In Figure 1, for the span "encouraging them", the constituent identity scaffold label is 1, the nonterminal label is S|VP, the non-terminal and parent label is S|VP+par=PP, and the common nonterminals label is set to OTHER.

Semantic Role Labeling
We contribute a new SRL model which contributes a strong baseline for experiments with syntactic scaffolds. The performance of this baseline itself is competitive with state-of-the-art methods ( §7).
FrameNet. In the FrameNet lexicon (Baker et al., 1998), a frame represents a type of event, situation, or relationship, and is associated with a set of semantic roles, called frame elements. A frame can be evoked by a word or phrase in a sentence, called a target. Each frame element of an evoked frame can then be realized in the sentence as a sentential span, called an argument (or it can be unrealized). Arguments for a given frame do not overlap.
PropBank. PropBank similarly disambiguates predicates and identifies argument spans. Targets are disambiguated to lexically specific senses rather than shared frames, and a set of generic roles is used for all targets, reducing the argument label space by a factor of 17. Most importantly, the arguments were annotated on top of syntactic constituents, directly coupling syntax and semantics. A detailed example for both formalisms is provided in Figure 1.
Semantic structure prediction is the task of identifying targets, labeling their frames or senses, and labeling all their argument spans in a sentence. Here we assume gold targets and frames, and consider only the SRL task.
Formally, a single input instance for argument identification consists of: an n-word sentence x = x 1 , x 2 , . . . , x n , a single target span t = t start , t end , and its evoked frame, or sense, f . The argument labeling task is to produce a segmentation of the sentence: s = s 1 , s 2 , . . . , s m for each input x. A segment s = i, j, y i: j corresponds to a labeled span of the sentence, where the label y i: j ∈ Y f ∪ {null} is either a role that the span fills, or null if the span does not fill any role. In the case of PropBank, Y f consists of all possible roles. The segmentation is constrained so that argument spans cover the sentence and do not overlap (i k+1 = 1 + j k for s k ; i 1 = 1; j m = n). Segments of length 1 such that i = j are allowed. A separate segmentation is predicted for each target annotation in a sentence.

Semi-Markov CRF
In order to model the non-overlapping arguments of a given target, we use a semi-Markov conditional random field (semi-CRF; Sarawagi et al., 2004). Semi-CRFs define a conditional distribution over labeled segmentations of an input sequence, and are globally normalized. A single target's arguments can be neatly encoded as a labeled segmentation by giving the spans in between arguments a reserved null label. Semi-Markov models are more powerful than BIO tagging schemes, which have been used successfully for PropBank SRL (Collobert et al., 2011;Zhou and Xu, 2015, inter alia), because the semi-Markov assumption allows scoring variable-length segments, rather than fixed-length label n-grams as under an (n − 1)-order Markov assumption. Computing the marginal likelihood with a semi-CRF can be done using dynamic programming in O(n 2 ) time ( §5.2). By filtering out segments longer than D tokens, this is reduced to O(nD).
Given an input x, a semi-CRF defines a conditional distribution p(s | x). Every segment s = i, j, y i: j is given a real-valued score, ψ( i, j, y i: j = r , x i: j ) = w r · v i: j , where v i: j is an embedding of the span ( §5.3) and w r is a parameter vector corresponding to its label. The score of the entire segmentation s is the sum of the scores of its segments: Ψ(x, s) = m k=1 ψ(s k , x i k : j k ). These scores are exponentiated and normalized to define the probability distribution. The sum-product variant of the semi-Markov dynamic programming algorithm is used to calculate the normalization term (required during learning). At test time, the maxproduct variant returns the most probable segmentation,ŝ = arg max s Ψ(s, x).
The parameters of the semi-CRF are learned to maximize a criterion related to the conditional loglikelihood of the gold-standard segments in the training corpus ( §5.2). The learner evaluates and adjusts segment scores ψ(s k , x) for every span in the sentence, which in turn involves learning embedded representations for all spans ( §5.3).

Softmax-Margin Objective
Typically CRF and semi-CRF models are trained to maximize a conditional log-likelihood objective. In early experiments, we found that incorporating a structured cost was beneficial; we do so by using a softmax-margin training objective (Gimpel and Smith, 2010), a "cost-aware" variant of log-likelihood: We design the cost function so that it factors by predicted span, in the same way Ψ does: cost(s, s * ) = s∈s cost(s, s * ) = s∈s I(s s * ). (6) The softmax-margin criterion, like log-likelihood, is globally normalized over all of the exponentially many possible labeled segmentations. The following zeroth-order semi-Markov dynamic program (Sarawagi et al., 2004) efficiently computes the new partition function: (7) where Z = α n , under the base case α 0 = 1.
The prediction under the model can be calculated using a similar dynamic program with the following recurrence where γ 0 = 1: Our model formulation enforces that arguments do not overlap. We do not enforce any other SRL constraints, such as non-repetition of core frame elements (Das et al., 2012).

Input Span Representation
This section describes the neural architecture used to obtain the span embedding, v i: j , corresponding to a span x i: j and the target in consideration, t = t start , t end . For the scaffold task, since the syntactic treebank does not contain annotations for semantic targets, we use the last verb in the sentence as a placeholder target, wherever target features are used. If there are no verbs, we use the first token in the sentence as a placeholder target.
The parameters used to learn v are shared between the tasks. We construct an embedding for the span using • h i and h j : contextualized embeddings for the words at the span boundary ( §5.3.1), • u i: j : a span summary that pools over the contents of the span ( §5.3.2), and • a i: j : and a hand-engineered feature vector for the span ( §5.3.3). This embedding is then passed to a feedforward layer to compute the span representation, v i: j .

Contextualized Token Embeddings
To obtain contextualized embeddings of each token in the input sequence, we run a bidirectional LSTM (Graves, 2012) with layers over the full input sequence. To indicate which token is a predicate, a linearly transformed one-hot embedding v is used, following Zhou and Xu (2015) and . The input vector representing the token at position q in the sentence is the concatenation of a fixed pretrained embedding x q and v q . When given as input to the bidirectional LSTM, this yields a hidden state vector h q representing the qth token in the context of the sentence.

Span Summary
Tokens within a span might convey different amounts of information necessary to label the span as a semantic argument. Following , we use an attention mechanism (Bahdanau et al., 2014) to summarize each span. Each contextualized token in the span is passed through a feed-forward network to obtain a weight, normalized to give σ k = softmax i k j w head · h k , where w head is a learned parameter. The weights σ are then used to obtain a vector that summarizes the span, u i: j = i k j; j−i<D σ k · h k .

Span Features
We use the following three features for each span: • width of the span in tokens (Das et al., 2014) • distance (in tokens) of the span from the target  • position of the span with respect to the target (before, after, overlap)  Each of these features is encoded as a one-hotembedding and then linearly transformed to yield a feature vector, a i: j .

Coreference Resolution
Coreference resolution is the task of determining clusters of mentions that refer to the same entity. Formally, the input is a document x = x 1 , x 2 , . . . , x n consisting of n words. The goal is to predict a set of clusters c = {c 1 , c 2 , . . .}, where each cluster c = {s 1 , s 2 , . . .} is a set of spans and each span s = i, j is a pair of indices such that 1 i j n.
As a baseline, we use the model of , which we describe briefly in this section. This model decomposes the prediction of coreference clusters into a series of span classification decisions. Every span s predicts an antecedent w s ∈ Y(s) = {null, s 1 , s 2 , . . . , s m }. Labels s 1 to s m indicate a coreference link between s and one of the m spans that precede it, and null indicates that s does not link to anything, either because it is not a mention or it is in a singleton cluster. The predicted clustering of the spans can be recovered by aggregating the predicted links.
Analogous to the SRL model ( §5), every span s is represented by an embedding v s , which is central to the model. For each span s and a potential antecedent a ∈ Y(s), pairwise coreference scores Ψ(v s , v a , φ(s, a)) are computed via feedforward networks with the span embeddings as input. φ(s, a) are pairwise discrete features encoding the distance between span s and span a and metadata, such as the genre and speaker information. We refer the reader to  for the details of the scoring function.
The scores from Ψ are normalized over the possible antecedents Y(s) of each span to induce a probability distribution for every span: In learning, we minimize the negative loglikelihood marginalized over the possibly correct antecedents: where D is the set of spans in the training dataset, and G(s) indicates the gold cluster of s if it belongs to one and {null} otherwise.
To operate under reasonable computational requirements, inference under this model requires a two-stage beam search, which reduces the number of span pairs considered. We refer the reader to  for details.
Input span representation. The input span embedding, v s for coreference resolution and its syntactic scaffold follow the definition used in §5.3, with the key difference of using no target features. Since there is a complete overlap of input sentences between D sc and D pr as the coreference annotations are also from OntoNotes (Pradhan et al., 2012), we reuse the v for the scaffold task. Additionally, instead of the entire document, each sentence in it is independently given as input to the bidirectional LSTMs.

Results
We evaluate our models on the test set of FrameNet 1.5 for frame SRL and on the test set of OntoNotes for both PropBank SRL and coreference. For the syntactic scaffold in each case, we use syntactic annotations from OntoNotes 5.0 (Weischedel et al., 2013;. 4 Further details on experimental settings and datasets have been elaborated in the supplemental material. Frame SRL. Table 1 shows the performance of all the scaffold models on frame SRL with respect to prior work and a semi-CRF baseline ( §5.1) without a syntactic scaffold. We follow the official evaluation from the SemEval shared task for frame-semantic parsing (Baker et al., 2007).
Prior work for frame SRL has relied on predicted syntactic trees, in two different ways: by using syntax-based rules to prune out spans of text that are unlikely to contain any frame's argument; and by using syntactic features in their statistical model (Das et al., 2014;FitzGerald et al., 2015;Kshirsagar et al., 2015).
The best published results on FrameNet 1.5 are due to Yang and Mitchell (2017). In their sequential model (seq), they treat argument identification as a sequence-labeling problem using a deep bidirectional LSTM with a CRF layer. In their relational model (Rel), they treat the same problem as a span classification problem. Finally, they introduce an ensemble to integerate both models, and use an integer linear program for inference satisfying SRL constraints. Though their model does not do any syntactic pruning, it does use syntactic features for argument identification and labeling. 5 Notably, all prior systems for frame SRL listed in Table 1 use a pipeline of syntax and semantics. Our semi-CRF baseline outperforms all prior work, without any syntax. This highlights the ben-efits of modeling spans and of global normalization.
Turning to scaffolds, even the most coarsegrained constituent identity scaffold improves the performance of our syntax-agnostic baseline. The nonterminal and nonterminal and parent scaffolds, which use more detailed syntactic representations, improve over this. The greatest improvements come from the scaffold model predicting common nonterminal labels (NP and PP, which are the most common syntactic categories of semantic arguments, vs. others): 3.6% absolute improvement in F 1 measure over prior work.
Contemporaneously with this work, Peng et al. (2018) proposed a system for joint frame-semantic and semantic dependency parsing. They report results for joint frame and argument identification, and hence cannot be directly compared in Table 1. We evaluated their output for argument identification only; our semi-CRF baseline model exceeds their performance by 1 F 1 , and our common nonterminal scaffold by 3.    (2015), employing deep architectures, and forgoing the use of any syntax.  improve on those results, and in analysis experiments, show that constraints derived using syntax may further improve performance. Tan et al. (2018) employ a similar approach but use feed-forward networks with selfattention. He et al. (2018a) use a span-based classification to jointly identify and label argument spans.
Our syntax-agnostic semi-CRF baseline model improves on prior work (excluding ELMo), showing again the value of global normalization in semantic structure prediction. We obtain further improvement of 0.8 absolute F 1 with the best syntactic scaffold from the frame SRL task. This indicates that a syntactic inductive bias is beneficial even when using sophisticated neural architectures. He et al. (2018a) also provide a setup where initialization was done with deep contextualized embeddings, ELMo (Peters et al., 2018), resulting in 85.5 F 1 on the OntoNotes test set. The improvements from ELMo are methodologically orthogonal to syntactic scaffolds.
Since the datasets for learning PropBank semantics and syntactic scaffolds completely overlap, the performance improvement cannot be attributed to a larger training corpus (or, by extension, a larger vocabulary), though that might be a factor for frame SRL.
A syntactic scaffold can match the performance of a pipeline containing carefully extracted syntactic features for semantic prediction (Swayamdipta et al., 2017). This, along with other recent ap-proaches (He et al., , 2018b show that syntax remains useful, even with strong neural models for SRL. Coreference. We report the results on four standard scores from the CoNLL evaluation: MUC, B 3 and CEAF φ 4 , and their average F 1 in Table 3. Prior competitive coreference resolution systems (Wiseman et al., 2016;Clark and Manning, 2016b,a) all incorporate synctactic information in a pipeline, using features and rules for mention proposals from predicted syntax.
Our baseline is the model from , described in §6. Similar to the baseline model for frame SRL, and in contrast with prior work, this model does not use any syntax.
We experiment with the best syntactic scaffold from the frame SRL task. We used NP, OTHER, and null as the labels for the common nonterminals scaffold here, since coreferring mentions are rarely prepositional phrases. The syntactic scaffold outperforms the baseline by 0.6 absolute F 1 . Contemporaneously,  proposed a model which takes in account higher order inference and more aggressive pruning, as well as initialization with ELMo embeddings, resulting in 73.0 average F 1 . All the above are orthogonal to our approach, and could be incorporated to yield higher gains.

Discussion
To investigate the performance of the syntactic scaffold, we focus on the frame SRL results, where we observed the greatest improvement with respect to a non-syntactic baseline.
We consider a breakdown of the performance by the syntactic phrase types of the arguments, provided in FrameNet 7 in Figure   ingly, we observe large improvements in the common nonterminals used (NP and PP). However, the phrase type annotations in FrameNet do not correspond exactly to the OntoNotes phrase categories. For instance, FrameNet annotates nonmaximal (A) and standard adjective phrases (AJP), while OntoNotes annotations for noun-phrases are flat, ignore the underlying adjective phrases. This explains why the syntax-agnostic baseline is able to recover the former while the scaffold is not. Similarly, for frequent frame elements, scaffolding improves performance across the board, as shown in Fig. 3. The largest improvements come for Theme and Goal, which are predominantly realized as noun phrases and prepositional phrases.

Conclusion
We introduced syntactic scaffolds, a multitask learning approach to incorporate syntactic bias into semantic processing tasks. Unlike pipelines and approaches which jointly model syntax and semantics, no explicit syntactic processing is required at runtime. Our method improves the performance of competitive baselines for semantic role labeling on both FrameNet and PropBank, and for coreference resolution. While our focus was on span-based tasks, syntactic scaffolds could be applied in other settings (e.g., dependency and graph representations). Moreover, scaffolds need not be syntactic; we can imagine, for example, semantic scaffolds being used to improve NLP applications with limited annotated data. It remains an open empirical question to determine the relative merits of different kinds of scaffolds and multitask learners, and how they can be most produc-tively combined. Our code is publicly available at https://github.com/swabhs/scaffolding.