Learning Joint Semantic Parsers from Disjoint Data

We present a new approach to learning a semantic parser from multiple datasets, even when the target semantic formalisms are drastically different and the underlying corpora do not overlap. We handle such “disjoint” data by treating annotations for unobserved formalisms as latent structured variables. Building on state-of-the-art baselines, we show improvements both in frame-semantic parsing and semantic dependency parsing by modeling them jointly.


Introduction
Semantic parsing aims to automatically predict formal representations of meaning underlying natural language, and has been useful in question answering (Shen and Lapata, 2007), text-to-scene generation (Coyne et al., 2012), dialog systems (Chen et al., 2013) and social-network extraction (Agarwal et al., 2014), among others.Various formal meaning representations have been developed corresponding to different semantic theories (Fillmore, 1982;Palmer et al., 2005;Flickinger et al., 2012;Banarescu et al., 2013).The distributed nature of these efforts results in a set of annotated resources that are similar in spirit, but not strictly compatible.A major axis of structural divergence in semantic formalisms is whether based on spans (Baker et al., 1998;Palmer et al., 2005) or dependencies (Surdeanu et al., 2008;Oepen et al., 2014;Banarescu et al., 2013;Copestake et al., 2005, inter alia).Depending on application requirements, either might be most useful in a given situation.
Learning from a union of these resources seems promising, since more data almost always translates into better performance.This is indeed the case for two prior techniques-parameter sharing  (FitzGerald et al., 2015;Kshirsagar et al., 2015), and joint decoding across multiple formalisms using cross-task factors that score combinations of substructures from each (Peng et al., 2017).Parameter sharing can be used in a wide range of multitask scenarios, when there is no data overlap or even any similarity between the tasks (Collobert and Weston, 2008;Søgaard and Goldberg, 2016).But techniques involving joint decoding have so far only been shown to work for parallel annotations of dependency-based formalisms, which are structurally very similar to each other (Lluís et al., 2013;Peng et al., 2017).Of particular interest is the approach of Peng et al., where three kinds of semantic graphs are jointly learned on the same input, using parallel annotations.However, as new annotation efforts cannot be expected to use the same original texts as earlier efforts, the utility of this approach is limited.
We propose an extension to Peng et al.'s formulation which addresses this limitation by considering disjoint resources, each containing only a single kind of annotation.Moreover, we consider structurally divergent formalisms, one dealing with semantic spans and the other with semantic arXiv:1804.05990v1[cs.CL] 17 Apr 2018 dependencies.We experiment on frame-semantic parsing (Gildea and Jurafsky, 2002;Das et al., 2010), a span-based semantic role labeling (SRL) task ( §2.1), and on a dependency-based minimum recursion semantic parsing (DELPH-IN MRS, or DM; Flickinger et al., 2012) task ( §2.2).See Figure 1 for an example sentence with gold FrameNet annotations, and author-annotated DM representations.
Our joint inference formulation handles missing annotations by treating the structures that are not present in a given training example as latent variables ( §3). 1 Specifically, semantic dependencies are treated as a collection of latent variables when training on FrameNet examples.
Using this latent variable formulation, we present an approach for relating spans and dependencies, by explicitly scoring affinities between pairs of potential spans and dependencies.Because there are a huge number of such pairs, we limit our consideration to only certain pairs-our design is inspired by the head rules of Surdeanu et al. (2008).Further possible span-dependency pairs are pruned using an 1 -penalty technique adapted from sparse structure learning ( §5).Neural network architectures are used to score framesemantic structures, semantic dependencies, as well as cross-task structures ( §4).
To summarize, our contributions include: • using a latent variable formulation to extend cross-task scoring techniques to scenarios where datasets do not overlap; • learning cross-task parts across structurally divergent formalisms; and • using an 1 -penalty technique to prune the space of cross task parts.Our approach results in a new state-of-the-art in frame-semantic parsing, improving prior work by 0.8% absolute F 1 points ( §6), and achieves competitive performance on semantic dependency parsing.

Tasks and Related Work
We describe the two tasks addressed in this work-frame-semantic parsing ( §2.1) and semantic dependency parsing ( §2.2)-and discuss how 1 Following past work on support vector machines with latent variables (Yu and Joachims, 2009), we use the term "latent variable," even though the model is not probabilistic.
their structures relate to each other ( §2.3).

Frame-Semantic Parsing
Frame-semantic parsing is a span-based task, under which certain words or phrases in a sentence evoke semantic frames.A frame is a group of events, situations, or relationships that all share the same set of participant and attribute types, called frame elements or roles.Gold supervision for frame-semantic parses comes from the FrameNet lexicon and corpus (Baker et al., 1998).
Concretely, for a given sentence, x, a framesemantic parse y consists of: • a set of targets, each being a short span (usually a single token2 ) that evokes a frame; • for each target t, the frame f that it evokes; and • for each frame f , a set of non-overlapping argument spans in the sentence, each argument a = (i, j, r) having a start token index i, end token index j and role label r.
The lemma and part-of-speech tag of a target comprise a lexical unit (or LU).The FrameNet ontology provides a mapping from an LU to the set of possible frames it could evoke, F .Every frame f ∈ F is also associated with a set of roles, R f under this ontology.For example, in Figure 1, the LU "fall.v"evokes the frame MOTION DIRECTIONAL.The roles THEME and PLACE (which are specific to MO-TION DIRECTIONAL), are filled by the spans "Only a few books" and "in the reading room" respectively.LOCATIVE RELATION has other roles (PROFILED REGION, ACCESSIBILITY, DEIXIS, etc.) which are not realized in this sentence.
In this work, we assume gold targets and LUs are given, and parse each target independently, following the literature (Johansson and Nugues, 2007;FitzGerald et al., 2015;Yang and Mitchell, 2017;Swayamdipta et al., 2017, inter alia).Moreover, following Yang and Mitchell (2017), we perform frame and argument identification jointly.Most prior work has enforced the constraint that a role may be filled by at most one argument span, but following Swayamdipta et al. (2017) we do not impose this constraint, requiring only that arguments for the same target do not overlap.

Semantic Dependency Parsing
Broad-coverage semantic dependency parsing (SDP; Oepen et al., 2014Oepen et al., , 2015Oepen et al., , 2016) ) represents sentential semantics with labeled bilexical dependencies.The SDP task mainly focuses on three semantic formalisms, which have been converted to dependency graphs from their original annotations.In this work we focus on only the DELPH-IN MRS (DM) formalism.
Each semantic dependency corresponds to a labeled, directed edge between two words.A single token is also designated as the top of the parse, usually indicating the main predicate in the sentence.For example in Figure 1, the left-most arc has head "Only", dependent "few", and label arg1.In semantic dependencies, the head of an arc is analogous to the target in frame semantics, the destination corresponds to the argument, and the label corresponds to the role.The same set of labels are available for all arcs, in contrast to the frame-specific roles in FrameNet.

Spans vs. Dependencies
Early semantic role labeling was span-based (Gildea and Jurafsky, 2002;Toutanova et al., 2008, inter alia), with spans corresponding to syntactic constituents.But, as in syntactic parsing, there are sometimes theoretical or practical reasons to prefer dependency graphs.To this end, Surdeanu et al. (2008) devised heuristics based on syntactic head rules (Collins, 2003) to transform PropBank (Palmer et al., 2005) annotations into dependencies.Hence, for PropBank at least, there is a very direct connection (through syntax) between spans and dependencies.
For many other semantic representations, such a direct relationship might not be present.Some semantic representations are designed as graphs from the start (Hajič et al., 2012;Banarescu et al., 2013), and have no gold alignment to spans.Conversely, some span-based formalisms are not annotated with syntax (Baker et al., 1998;He et al., 2015), 3 and so head rules would require using (noisy and potentially expensive) predicted syntax.
Inspired by the head rules of Surdeanu et al. (2008), we design cross-task parts, without relying 3 In FrameNet, phrase types of arguments and their grammatical function in relation to their target have been annotated.But in order to apply head rules, the internal structure of arguments (or at least their semantic heads) would also require syntactic annotations.
on gold or predicted syntax (which may be either unavailable or error-prone) or on heuristics.

Model
Given an input sentence x, and target t with its LU , denote the set of valid frame-semantic parses ( §2.1) as Y(x, t, ), and valid semantic dependency parses as Z(x). 4 We learn a parameterized function S that scores candidate parses.Our goal is to jointly predict a frame-semantic parse and a semantic dependency graph by selecting the highest scoring candidates: S(y, z, x, t, ). (1) The overall score S can be decomposed into the sum of frame SRL score S f , semantic dependency score S d , and a cross-task score S c : (2) S f and S c require access to the target and LU, in addition to x, but S d does not.For clarity, we omit the dependence on the input sentence, target, and lexical unit, whenever the context is clear.Below we describe how each of the scores is computed based on the individual parts that make up the candidate parses.
Frame SRL score.The score of a framesemantic parse consists of • the score for a predicate part, s f (p) where each predicate is defined as a combination of a target t, the associated LU, , and the frame evoked by the LU, f ∈ F ; • the score for argument parts, s f (a), each associated with a token span and semantic role from R f .Together, this results in a set of frame-semantic parts of size O(n 2 |F | |R f |). 5 The score for a frame semantic structure y is the sum of local scores of parts in y: (3) The computation of s f is described in §4.2.Semantic dependency score.Following Martins and Almeida (2014), we consider three types of parts in a semantic dependency graph: semantic heads, unlabeled semantic arcs, and labeled semantic arcs.Analogous to Equation 3, the score for a dependency graph z is the sum of local scores: The computation of s d is described in §4.3.Cross task score.In addition to task-specific parts, we introduce a set C of cross-task parts.Each cross-task part relates an argument part from y to an unlabeled dependency arc from z. Based on the head-rules described in §2.3, we consider unlabeled arcs from the target to any token inside the span. 6Intuitively, an argument in FrameNet would be converted into a dependency from its target to the semantic head of its span.Since we do not know the semantic head of the span, we consider all tokens in the span as potential modifiers of the target.Figure 2 shows examples of crosstask parts.The cross-task score is given by The computation of s c is described in §4.4.
In contrast to previous work (Lluís et al., 2013;Peng et al., 2017), where there are parallel annotations for all formalisms, our input sentences contain only one of the two-either the span-based frame SRL annotations, or semantic dependency graphs from DM.To handle missing annotations, we treat semantic dependencies z as latent when decoding frame-semantic structures. 7Because the DM dataset we use does not have target annotations, we do not use latent variables for frame semantic structures when predicting semantic dependency graphs.The parsing problem here reduces to in contrast with Equation 1.

Parameterizations of Scores
This section describes the parametrization of the scoring functions from §3.At a very high level: we learn contextualized token and span vectors using a bidirectional LSTM (biLSTM; Graves, 2012) and multilayer perceptrons (MLPs) ( §4.1); we learn lookup embeddings for LUs, frames, roles, and arc labels; and to score a part, we combine the relevant representations into a single scalar score using a (learned) low-rank multilinear mapping.Scoring frames and arguments is detailed in §4.2, that of dependency structures in §4.3, and §4.4 shows how to capture interactions between arguments and dependencies.All parameters are learned jointly, through the optimization of a multitask objective ( §5).
Tensor notation.The order of a tensor is the number of its dimensions-an order-2 tensor is a matrix and an order-1 tensor is a vector.Let ⊗ denote tensor product; the tensor product of two order-2 tensors A and B yields an order-4 tensor where (A ⊗ B) i,j,k,l = A i,j B k,l .We use •, • to denote inner products.

Token and Span Representations
The representations of tokens and spans are formed using biLSTMs followed by MLPs.
Contextualized token representations.Each token in the input sentence x is mapped to an embedding vector.Two LSTMs (Hochreiter and Schmidhuber, 1997) are run in opposite directions over the input vector sequence.We use the concatenation of the two hidden representations at each position i as a contextualized word embedding for each token: Span representations.Following Lee et al. (2017), span representations are computed based on boundary word representations and discrete length and distance features.Concretely, given a target t and its associated argument a = (i, j, r) with boundary indices i and j, we compute three features φ t (a) based on the length of a, and the distances from i and j to the start of t.We concatenate the token representations at a's boundary with the discrete features φ t (a).We then use a two-layer tanh-MLP to compute the span representation: The target representation g tgt (t) is similarly computed using a separate MLP tgt , with a length feature but no distance features.

Frame and Argument Scoring
As defined in §3, the representation for a predicate part incorporates representations of a target span, the associated LU and the frame evoked by the LU.The score for a predicate part is given by a multilinear mapping: where W is a low-rank order-3 tensor of learned parameters, and g fr (f ) and g lu ( ) are learned lookup embeddings for the frame and LU.
A candidate argument consists of a span and its role label, which in turn depends on the frame, target and LU.Hence the score for argument part, a = (i, j, r) is given by extending definitions from Equation 9: where U is a low-rank order-2 tensor of learned parameters and g role (r) is a learned lookup embedding of the role label.

Dependency Scoring
Local scores for dependencies are implemented with two-layer tanh-MLPs, followed by a final linear layer reducing the represenation to a single scalar score.For example, let u = i→j denote an unlabeled arc (ua).Its score is: where w ua is a vector of learned weights.The scores for other types of parts are computed similarly, but with separate MLPs and weights.

Cross-Task Part Scoring
As shown in Figure 2, each cross-task part c consists of two first-order parts: a frame argument part a, and an unlabeled dependency part, u.The score for a cross-task part incorporates both: where V is a low-rank order-2 tensor of parameters.Following previous work (Lei et al., 2014;Peng et al., 2017), we construct the parameter tensors W, U, and V so as to upper-bound their ranks.

Training and Inference
All parameters from the previous sections are trained using a max-margin training objective ( §5.1).For inference, we use a linear programming procedure, and a sparsity-promoting penalty term for speeding it up ( §5.2).

Max-Margin Training
Let y * denote the gold frame-semantic parse, and let δ (y, y * ) denote the cost of predicting y with respect to y * .We optimize the latent structured hinge loss (Yu and Joachims, 2009), which gives a subdifferentiable upper-bound on δ: Following Martins and Almeida (2014), we use a weighted Hamming distance as the cost function, where, to encourage recall, we use costs 0.6 for false negative predictions and 0.4 for false positives.Equation 13 can be evaluated by applying the same max-decoding algorithm twice-once with cost-augmented inference (Crammer et al., 2006), and once more keeping y * fixed.Training then aims to minimize the average loss over all training instances. 8Another potential approach to training a model on disjoint data would be to marginalize out the latent structures and optimize the conditional loglikelihood (Naradowsky et al., 2012).Although max-decoding and computing marginals are both NP-hard in general graphical models, there are more efficient off-the-shelf implementations for approximate max-decoding, hence, we adopt a max-margin formulation.

Inference
We formulate the maximizations in Equation 13as 0-1 integer linear programs and use AD 3 to solve them (Martins et al., 2011).We only enforce a non-overlapping constraint when decoding FrameNet structures, so that the argument identification subproblem can be efficiently solved by a dynamic program (Kong et al., 2016;Swayamdipta et al., 2017).When decoding semantic dependency graphs, we enforce the determinism constraint (Flanigan et al., 2014), where certain labels may appear on at most one arc outgoing from the same token.Inference speedup by promoting sparsity.As discussed in §3, even after pruning, the number of within-task parts is linear in the length of the input sentence, so the number of cross-task parts is quadratic.This leads to potentially very slow inference.We address this problem by imposing an 1 penalty on the cross-task part scores: where λ is a hyperparameter, set to 0.01 as a practical tradeoff between efficiency and development set performance.Whenever the score for a crosstask part is driven to zero, that part's score no longer needs to be considered during inference.
It is important to note that by promoting sparsity this way, we do not prune out any candidate solutions.We are instead encouraging fewer terms in the scoring function, which leads to smaller, faster inference problems even though the space of feasible parses is unchanged.The above technique is closely related to a line of work in estimating the structure of sparse graphical models (Yuan and Lin, 2007;Friedman et al., 2008), where an 1 penalty is applied to the inverse covariance matrix in order to induce a smaller number of conditional dependencies between variables.To the best of our knowledge, we are the first to apply this technique to the output of neural scoring functions.Here, we are interested in learn-

Experiments
Datasets.Our model is evaluated on two different releases of FrameNet: FN 1.5 and FN 1.7,9 using splits from Swayamdipta et al. (2017).Following Swayamdipta et al. (2017) and Yang and Mitchell (2017), each target annotation is treated as a separate training instance.We also include as training data the exemplar sentences, each annotated for a single target, as they have been reported to improve performance (Kshirsagar et al., 2015;Yang and Mitchell, 2017).For semantic dependencies, we use the English DM dataset from the SemEval 2015 Task 18 closed track (Oepen et al., 2015).10DM contains instances from the WSJ corpus for training and both in-domain (id) and out-of-domain (ood) test sets, the latter from the Brown corpus.11Table 1 summarizes the sizes of the datasets.
Baselines.We compare FN performance of our joint learning model (FULL) to two baselines: BASIC: A single-task frame SRL model, trained using a structured hinge objective.NOCTP: A joint model without cross-task parts.
It demonstrates the effect of sharing parameters in word embeddings and LSTMs (like in FULL).It does not use latent semantic dependency structures, and aims to minimize the sum of training losses from both tasks.We also compare semantic dependency parsing performance against the single task model by Peng To ensure fair comparison with our FULL model, we made several modifications to their implementation ( §6.3).We observed performance improvements from our reimplementation, which can be seen in Table 5. Pruning strategies.For frame SRL, we discard argument spans longer than 20 tokens (Swayamdipta et al., 2017).We further pretrain an unlabeled model and prune spans with posteriors lower than 1/n 2 , with n being the input sentence length.For semantic dependencies, we generally follow Martins and Almeida (2014), replacing their feature-rich pruner with neural networks.We observe that O(n) spans/arcs remain after pruning, with around 96% FN development recall, and more than 99% for DM.12 6.1 Empirical Results FN parsing results.Table 2 compares our full frame-semantic parsing results to previous systems.Among them, Täckström et al. (2015) and Roth (2016) (Swayamdipta et al., 2017) uses predicted frames from FitzGerald et al. (2015), and improves argument identification using a softmax-margin segmental RNN.They observe further improvements from product of experts ensembles (Hinton, 2002).
The best published FN 1.5 results are due to Yang and Mitchell (2017).Their relational model (REL) formulates argument identification as a sequence of local classifications.They additionally introduce an ensemble method (denoted as ALL) to integrate the predictions of a sequential CRF.They use a linear program to jointly predict frames and arguments at test time.As shown in Table 2, our single-model performance outperforms their REL model, and is on par with their ALL model.For a fair comparison, we build an ensemble (FULL, 2×) by separately training two models, differing only in random seeds, and averaging their part scores.Our ensembled model outperforms previous best results by 0.8% absolute.
Table 3  cial directories.The Ambiguous setting compares lexical units with more than one possible frames.
Our approach improves over all previous models under both settings, demonstrating a clear benefit from joint learning.We observe similar trends on FN 1.7 for both full structure extraction and for frame identification only (Table 4).FN 1.7 extends FN 1.5 with more consistent annotations.Its test set is different from that of FN 1.5, so the results are not directly comparable to Table 2.We are the first to report frame-semantic parsing results on FN 1.7, and we encourage future efforts to do so as well.
Semantic dependency parsing results.(2017), denoted as NeurboParser (FREDA3).They adopted a multitask learning approach, jointly predicting three different parallel semantic dependency annotations.Our FULL model's in-domain test performance is on par with FREDA3, and improves over it by 0.6% absolute F 1 on out-ofdomain test data.Our ensemble of two FULL models achieves a new state-of-the-art in both indomain and out-of-domain test performance.

Analysis
Error type breakdown.Similarly to He et al. (2017), we categorize prediction errors made by the BASIC and FULL models in Table 6.Entirely missing an argument accounts for most of the errors for both models, but we observe fewer errors by FULL compared to BASIC in this category.FULL tends to predict more arguments in general, including more incorrect arguments.
Since candidate roles are determined by frames, frame and role errors are highly correlated.Therefore, we also show the role errors when frames are correctly predicted (parenthesized numbers in the second row).When a predicted argument span matches a gold span, predicting the semantic role is less challenging.Role errors account for only around 13% of all errors, and half of them are due to mispredictions of frames.Performance by argument length.Figure 3 plots dev.precision and recall of both BASIC and FULL against binned argument lengths.We ob- serve two trends: (a) FULL tends to predict longer arguments (averaging 3.2) compared to BASIC (averaging 2.9), while keeping similar precision;14 (b) recall improvement in FULL mainly comes from arguments longer than 4.

Implementation Details
Our implementation is based on DyNet (Neubig et al., 2017). 15We use predicted part-of-speech tags and lemmas using NLTK (Bird et al., 2009). 16arameters are optimized with stochastic subgradient descent for up to 30 epochs, with 2 norms of gradients clipped to 1.We use 0.33 as initial learning rate, and anneal it at a rate of 0.5 every 10 epochs.Early stopping is applied based on FN development F 1 .We apply logarithm with base 2 to all discrete features, e.g., log 2 (d + 1) for distance feature valuing d.To speed up training, we randomly sample a 35% subset from the FN exemplar instances each epoch.Hyperparameters.Each input token is represented as the concatenation a word embedding vector, a learned lemma vector, and a learned vector for part-of speech, all updated during training.We use 100-dimensional GloVe (Pennington et al., 2014) to initialize word embeddings.We apply word dropout (Iyyer et al., 2015) and randomly replace a word w with a special UNK symbol with probability

Conclusion
We presented a novel multitask approach to learning semantic parsers from disjoint corpora with structurally divergent formalisms.We showed how joint learning and prediction can be done with scoring functions that explicitly relate spans and dependencies, even when they are never observed together in the data.We handled the resulting inference challenges with a novel adaptation of graphical model structure learning to the deep learning setting.We raised the state-ofthe-art on DM and FrameNet parsing by learning from both, despite their structural differences and non-overlapping data.While our selection of factors is specific to spans and dependencies, our general techniques could be adapted to work with more combinations of structured prediction tasks.We have released our implementation at https://github.com/Noahs-ARK/NeurboParser.

Figure 1 :
Figure 1: An example sentence from the FrameNet 1.5 corpus, shown with an author-annotated DM semantic dependency graph (above) and framesemantic annotation (below).Two more gold frames (and their arguments) have been omitted for space.

Figure 2 :
Figure 2: An example of cross-task parts from the FrameNet 1.5 development set.We enumerate all unlabeled semantic dependencies from the first word of the target (includes) to any token inside the span.The red bolded arc indicates the prediction of our model.
implement a two-stage pipeline and use the method from Hermann et al. (2014) to predict frames.FitzGerald et al. (2015) uses the

α
1+#(w) , with #(w) being the count of w in the training set.We follow the default parameters initialization procedure by DyNet, and an 2

Table 1 :
Number of instances in datasets.

Table 2 :
FN 1.5 full structure extraction test performance.† denotes the models jointly predicting frames and arguments, and other systems implement two-stage pipelines and use the algorithm by Hermann et al. (2014) to predict frames.K× denotes a product-of-experts ensemble of K models.* Ensembles a sequential tagging CRF and a relational model.Bold font indicates best performance among all systems.et al. (2017), denoted as NeurboParser (BASIC).

Table 3 :
Frame identification accuracy on the FN 1.5 test set.Ambiguous evaluates only on lexical units having more than one possible frames.† denotes joint frame and argument identification, and bold font indicates best performance. 13same pipeline formulation, but improves the frame identification of Hermann et al. (2014) with better syntactic features.open-SESAME

Table 5 :
Hartmann et al. (2017)7)cation results byYang and Mitchell (2017)andHartmann et al. (2017)are 75.7 and 73.8.Their ambiguous lexical unit sets are different from the one extracted from the official frame directory, and thus the results are not comparable to those in Table3.Labeled parsing performance in F 1 score for DM semantic dependencies.id denotes indomain WSJ test data, and ood denotes out-ofdomain brown corpus test data.Bold font indicates best performance.

Table 6 :
Percentage of errors made by BASIC and FULL models on the FN 1.5 development set.Parenthesized numbers show the percentage of role errors when frame predictions are correct.structuresandsemanticdependency graphs, the FULL model outperforms the baselines by more than 0.6% absolute F 1 scores under both settings.Previous state-of-the-art results on DM are due to the joint learning model ofPeng et al.

Table 7 :
Hyperparameters used in the experiments.Parenthesized numbers indicate those used by the pretrained pruners.penalty of 10 −6 is applied to all weights.See Table 7 for other hyperparameters.Modifications to Peng et al. (2017).To ensure fair comparisons, we note two implementation modifications to Peng et al.'s basic model.We use a more recent version (2.0) of the DyNet toolkit, and we use 50-dimensional lemma embeddings instead of their 25-dimensional randomly-initialized learned word embeddings.