Deep Multitask Learning for Semantic Dependency Parsing

We present a deep neural architecture that parses sentences into three semantic dependency graph formalisms. By using efficient, nearly arc-factored inference and a bidirectional-LSTM composed with a multi-layer perceptron, our base system is able to significantly improve the state of the art for semantic dependency parsing, without using hand-engineered features or syntax. We then explore two multitask learning approaches—one that shares parameters across formalisms, and one that uses higher-order structures to predict the graphs jointly. We find that both approaches improve performance across formalisms on average, achieving a new state of the art. Our code is open-source and available at https://github.com/Noahs-ARK/NeurboParser.


Introduction
Labeled directed graphs are a natural and flexible representation for semantics (Copestake et al., 2005;Baker et al., 2007;Surdeanu et al., 2008;Banarescu et al., 2013, inter alia). Their generality over trees, for instance, allows them to represent relational semantics while handling phenomena like coreference and coordination. Even syntactic formalisms are moving toward graphs (de Marneffe et al., 2014). However, full semantic graphs can be expensive to annotate, and efforts are fragmented across competing semantic theories, leading to a limited number of annotations in any one formalism. This makes learning to parse more difficult, especially for powerful but data-hungry machine learning techniques like neural networks.
In this work, we hypothesize that the overlap among theories and their corresponding represen-Last week , shareholders took their money and ran . tations can be exploited using multitask learning (Caruana, 1997), allowing us to learn from more data. We use the 2015 SemEval shared task on Broad-Coverage Semantic Dependency Parsing (SDP; Oepen et al., 2015) as our testbed.
The shared task provides an English-language corpus with parallel annotations for three semantic graph representations, described in §2. Though the shared task was designed in part to encourage comparison between the formalisms, we are the first to treat SDP as a multitask learning problem. As a strong baseline, we introduce a new system that parses each formalism separately ( §3). It uses a bidirectional-LSTM composed with a multi-layer perceptron to score arcs and predicates, and has efficient, nearly arc-factored inference. Experiments show it significantly improves on state-of-the-art methods ( §3.4).
We then present two multitask extensions ( §4.  Table 1: Graph statistics for in-domain (WSJ, "id") and out-of-domain (Brown corpus, "ood") data. Numbers taken from Oepen et al. (2015). and §4.3), with a parameterization and factorization that implicitly models the relationship between multiple formalisms. Experiments show that both techniques improve over our basic model, with an additional (but smaller) improvement when they are combined ( §4.5). Our analysis shows that the improvement in unlabeled F 1 is greater for the two formalisms that are more structurally similar, and suggests directions for future work. Finally, we survey related work ( §5), and summarize our contributions and findings ( §6).

Broad-Coverage Semantic Dependency Parsing (SDP)
First defined in a SemEval 2014 shared task (Oepen et al., 2014), and then extended by Oepen et al. (2015), the broad-coverage semantic depency parsing (SDP) task is centered around three semantic formalisms whose annotations have been converted into bilexical dependencies. See Figure 1 for an example. The formalisms come from varied linguistic traditions, but all three aim to capture predicate-argument relations between content-bearing words in a sentence. While at first glance similar to syntactic dependencies, semantic dependencies have distinct goals and characteristics, more akin to semantic role labeling (SRL; Gildea and Jurafsky, 2002) or the abstract meaning representation (AMR; Banarescu et al., 2013). They abstract over different syntactic realizations of the same or similar meaning (e.g., "She gave me the ball." vs. "She gave the ball to me."). Conversely, they attempt to distinguish between different senses even when realized in similar syntactic forms (e.g., "I baked in the kitchen." vs. "I baked in the sun.").
Structurally, they are labeled directed graphs whose vertices are tokens in the sentence. This is in contrast to AMR whose vertices are abstract concepts, with no explicit alignment to tokens, which makes parsing more difficult . Their arc labels encode broadly-applicable semantic relations rather than being tailored to any specific downstream application or ontology. 1 They are not necessarily trees, because a token may be an argument of more than one predicate (e.g., in "John wants to eat," John is both the wanter and the would-be eater). Their analyses may optionally leave out non-contentbearing tokens, such as punctuation or the infinitival "to," or prepositions that simply mark the type of relation holding between other words. But when restricted to content-bearing tokens (including adjectives, adverbs, etc.), the subgraph is connected. In this sense, SDP provides a whole-sentence analysis. This is in contrast to PropBank-style SRL, which gives an analysis of only verbal and nominal predicates . Semantic dependency graphs also tend to have higher levels of nonprojectivity than syntactic trees (Oepen et al., 2014). Sentences with graphs containing cycles have been removed from the dataset by the organizers, so all remaining graphs are directed acyclic graphs. Table 1 summarizes some of the dataset's high-level statistics.
Formalisms. Following the SemEval shared tasks, we consider three formalisms.
The DM (DELPH-IN MRS) representation comes from DeepBank (Flickinger et al., 2012), which are manually-corrected parses from the LinGO English Resource Grammar (Copestake and Flickinger, 2000). LinGO is a head-driven phrase structure grammar (HPSG; Pollard and Sag, 1994) with minimal recursion semantics (Copestake et al., 2005). The PAS (Predicate-Argument Structures) representation is extracted from the Enju Treebank, which consists of automatic parses from the Enju HPSG parser (Miyao, 2006). PAS annotations are also available for the Penn Chinese Treebank (Xue et al., 2005). The PSD (Prague Semantic Dependencies) representation is extracted from the tectogrammatical layer of the Prague Czech-English Dependency Treebank (Hajič et al., 2012). PSD annotations are also available for a Czech translation of the WSJ Corpus. In this work, we train and evaluate only on English annotations.
Of the three, PAS follows syntax most closely, and prior work has found it the easiest to predict. PSD has the largest set of labels, and parsers  have significantly lower performance on it (Oepen et al., 2015).

Single-Task SDP
Here we introduce our basic model, in which training and prediction for each formalism is kept completely separate. We also lay out basic notation, which will be reused for our multitask extensions.

Problem Formulation
The output of semantic dependency parsing is a labeled directed graph (see Figure 1). Each arc has a label from a predefined set L, indicating the semantic relation of the child to the head. Given input sentence x, let Y(x) be the set of possible semantic graphs over x. The graph we seek maximizes a score function S: We decompose S into a sum of local scores s for local structures (or "parts") p in the graph: S(x, y) = p∈y s(p).
For notational simplicity, we omit the dependence of s on x. See Figure 2a for examples of local structures. s is a parameterized function, whose parameters (denoted Θ and suppressed here for clarity) will be learned from the training data ( §3.3). Since we search over every possible labeled graph (i.e., considering each labeled arc for each pair of words), our approach can be considered a graph-based (or all-pairs) method. The models presented in this work all share this common graph-based approach, differing only in the set of structures they score and in the parameterization of the scoring function s. This approach also underlies state-of-the-art approaches to SDP (Martins and Almeida, 2014).

Basic Model
Our basic model is inspired by recent successes in neural arc-factored graph-based dependency parsing (Kiperwasser and Goldberg, 2016;Dozat and Manning, 2017;Kuncoro et al., 2016). It borrows heavily from the neural arc-scoring architectures in those works, but decodes with a different algorithm under slightly different constraints.

Basic Structures
Our basic model factors over three types of structures (p in Equation 2): • predicate, indicating a predicate word, denoted i→·; • unlabeled arc, representing the existence of an arc from a predicate to an argument, denoted i→j; • labeled arc, an arc labeled with a semantic role, denoted i → j.
Here i and j are word indices in a given sentence, and indicates the arc label. This list corresponds to the most basic structures used by Martins and Almeida (2014). Selecting an output y corresponds precisely to selecting which instantiations of these structures are included.
To ensure the internal consistency of predictions, the following constraints are enforced during decoding: • i→· if and only if there exists at least one j such that i→j; • If i→j, then there must be exactly one label such that i → j. Conversely, if not i→j, then there must not exist any i → j; We also enforce a determinism constraint : certain labels must not appear on more than one arc emanating from the same token. The set of deterministic labels is decided based on their appearance in the training set. Notably, we do not enforce that the predicted graph is connected or spanning. If not for the predicate and determinism constraints, our model would be arc-factored, and decoding could be done for each i, j pair independently. Our structures do overlap though, and we employ AD 3 (Martins et al., 2011) to find the highest-scoring internally consistent semantic graph. AD 3 is an approximate discrete optimization algorithm based on dual decomposition. It can be used to decode factor graphs over discrete variables when scored structures overlap, as is the case here.

Basic Scoring
Similarly to Kiperwasser and Goldberg (2016), our model learns representations of tokens in a sentence using a bi-directional LSTM (BiLSTM). Each different type of structure (predicate, unlabeled arc, labeled arc) then shares these same BiL-STM representations, feeding them into a multilayer perceptron (MLP) which is specific to the structure type. We present the architecture slightly differently from prior work, to make the transition to the multitask scenario ( §4) smoother. In our presentation, we separate the model into a function φ that represents the input (corresponding to the BiLSTM and the initial layers of the MLPs), and a function ψ that represents the output (corresponding to the final layers of the MLPs), with the scores given by their inner product. 2 Distributed input representations. Long shortterm memory networks (LSTMs) are a variant of recurrent neural networks (RNNs) designed to alleviate the vanishing gradient problem in RNNs (Hochreiter and Schmidhuber, 1997). A bi-directional LSTM (BiLSTM) runs over the sequence in both directions (Schuster and Paliwal, 1997;Graves, 2012).
Given an input sentence x and its corresponding part-of-speech tag sequence, each token is mapped to a concatenation of its word embedding vector and POS tag vector. Two LSTMs are then run in opposite directions over the input vector sequence, outputting the concatenation of the two hidden vectors at each position i: s dependence on x and its own parameters). h i can be thought of as an encoder that contextualizes each token conditioning on all of its context, without any Markov assumption. h's parameters are learned jointly with the rest of the model ( §3.3); we refer the readers to Cho (2015) for technical details.
The input representation φ of a predicate structure depends on the representation of one word:  For unlabeled arc and labeled arc structures, it depends on both the head and the modifier (but not the label, which is captured in the distributed output representation): Distributed output representations. NLP researchers have found that embedding discrete output labels into a low dimensional real space is an effective way to capture commonalities among them (Srikumar and Manning, 2014;Hermann et al., 2014;FitzGerald et al., 2015, inter alia).
In neural language models (Bengio et al., 2003;Mnih and Hinton, 2007, inter alia) the weights of the output layer could also be regarded as an output embedding.
We associate each first-order structure p with a d-dimensional real vector ψ(p) which does not depend on particular words in p. Predicates and unlabeled arcs are each mapped to a single vector: and each label gets a vector: Scoring. Finally, we use an inner product to score first-order structures: Figure 3 illustrates our basic model's architecture.

Learning
The parameters of the model are learned using a max-margin objective. Informally, the goal is to learn parameters for the score function so that the gold parse is scored over every incorrect parse with a margin proportional to the cost of the incorrect parse. More formally, let D = ( be the training set consisting of N pairs of sentence x i and its gold parse y i . Training is then the following 2 -regularized empirical risk minimization problem: where Θ is all parameters in the model, and L is the structured hinge loss: c is a weighted Hamming distance that trades off between precision and recall (Taskar et al., 2004). Following Martins and Almeida (2014), we encourage recall over precision by using the costs 0.6 for false negative arc predictions and 0.4 for false positives.

Experiments
We evaluate our basic model on the English dataset from SemEval 2015 Task  Empirical results. As our model uses no explicit syntactic information, the most comparable models to ours are two state-of-the-art closed track systems due to Du et al. (2015) and Almeida and Martins (2015). Du et al. (2015) rely on graphtree transformation techniques proposed by Du et al. (2014), and apply a voting ensemble to wellstudied tree-oriented parsers. Closely related to ours is Almeida and Martins (2015), who used rich, hand-engineered second-order features and AD 3 for inference.

Multitask SDP
We introduce two extensions to our single-task model, both of which use training data for all three formalisms to improve performance on each formalism's parsing task. We describe a firstorder model, where representation functions are enhanced by parameter sharing while inference is kept separate for each task ( §4.2). We then introduce a model with cross-task higher-order structures that uses joint inference across different tasks ( §4.3). Both multitask models use AD 3 for decoding, and are trained with the same marginbased objective, as in our single-task model.

Problem Formulation
We will use an additional superscript t ∈ T to distinguish the three tasks (e.g., y (t) , φ (t) ), where T = {DM, PAS, PSD}. Our task is now to predict three graphs {y (t) } t∈T for a given input sentence x. Multitask SDP can also be understood as parsing x into a single unified multigraph y = t∈T y (t) . Similarly to Equations 1-2, we decompose y's score S(x, y) into a sum of local scores for local structures in y, and we seek a multigrapĥ y that maximizes S(x, y).

Multitask SDP with Parameter Sharing
A common approach when using BiLSTMs for multitask learning is to share the BiLSTM part of the model across tasks, while training specialized classifiers for each task (Søgaard and Goldberg, 2016). In this spirit, we let each task keep its own specialized MLPs, and explore two variants of our model that share parameters at the BiLSTM level.
The first variant consists of a set of task-specific BiLSTM encoders as well as a common one that is shared across all tasks. We denote it FREDA. FREDA uses a neural generalization of "frustratingly easy" domain adaptation (Daumé III, 2007;Kim et al., 2016), where one augments domainspecific features with a shared set of features to capture global patterns. Formally, let {h (t) } t∈T denote the three task-specific encoders. We introduce another encoder h that is shared across all tasks. Then a new set of input functions {φ (t) } t∈T can be defined as in Equations 3a-3c, for example: The predicate and unlabeled arc versions are analogous. The output representations {ψ (t) } remain task-specific, and the score is still the inner product between the input representation and the output representation.
The second variant, which we call SHARED, uses only the shared encoder h, and doesn't use task-specific encoders {h (t) }. It can be understood as a special case of FREDA where the dimensions of the task-specific encoders are 0.

Multitask SDP with Cross-Task Structures
In syntactic parsing, higher-order structures have commonly been used to model interactions be-tween multiple adjacent arcs in the same dependency tree (Carreras, 2007;Smith and Eisner, 2008;Martins et al., 2009;Zhang et al., 2014, inter alia). Lluís et al. (2013), in contrast, used second-order structures to jointly model syntactic dependencies and semantic roles. Similarly, we use higher-order structures across tasks instead of within tasks. In this work, we look at interactions between arcs that share the same head and modifier. 5 See Figures 2b and 2c for examples of higher-order cross-task structures.
Higher-order structure scoring. Borrowing from , we introduce a low-rank tensor scoring strategy that, given a higher-order structure p, models interactions between the firstorder structures (i.e., arcs) p is made up of. This approach builds on and extends the parameter sharing techniques in §4.2. It can either follow FREDA or SHARED to get the input representations for first-order structures. We first introduce basic tensor notation. The order of a tensor is the number of its dimensions. The outer product of two vectors forms a second- We denote the inner product of two tensors of the same dimensions by ·, · , which first takes their element-wise product, then sums all the elements in the resulting tensor.
For example, let p be a labeled third-order structure, including one labeled arc from each of the three different tasks: p = {p (t) } t∈T . Intuitively, s(p) should capture every pairwise interaction between the three input and three output representations of p. Formally, we want the score function to include a parameter for each term in the outer product of the representation vectors: where W is a sixth-order tensor of parameters. 6 With typical dimensions of representation vectors, this leads to an unreasonably large number of parameters. Following , we upperbound the rank of W by r to limit the number of parameters (r is a hyperparameter, decided empirically). Using the fact that a tensor of rank at most r can be decomposed into a sum of r rank-1 tensors (Hitchcock, 1927), we reparameterize W to enforce the low-rank constraint by construction: LA ∈ R r×d are now our parameters.
[·] j,: denotes the jth row of a matrix. Substituting this back into Equation 9 and rearranging, the score function s(p) can then be rewritten as: (11) We refer readers to Kolda and Bader (2009) for mathematical details.
For labeled higher-order structures our parameters consist of the set of six matrices, {U UA directly, and W is never explicitly instantiated.
Inference and learning. Given a sentence, we use AD 3 to jointly decode all three formalisms. 7 The training objective used for learning is the sum of the losses for individual tasks.

Implementation Details
Each input token is mapped to a concatenation of three real vectors: a pre-trained word vector; a randomly-initialized word vector; and a randomlyinitialized POS tag vector. 8 All three are updated 7 Joint inference comes at a cost; our third-order model is able to decode roughly 5.2 sentences (i.e., 15.5 task-specific graphs) per second on a single Xeon E5-2690 2.60GHz CPU. 8 There are minor differences in the part-of-speech data provided with the three formalisms. For the basic models, we  Table 3: Hyperparameters used in the experiments.
during training. We use 100-dimensional GloVe (Pennington et al., 2014) vectors trained over Wikipedia and Gigaword as pre-trained word embeddings. To deal with out-of-vocabulary words, we apply word dropout (Iyyer et al., 2015) and randomly replace a word w with a special unksymbol with probability α 1+#(w) , where #(w) is the count of w in the training set.
Models are trained for up to 30 epochs with Adam (Kingma and Ba, 2015), with β 1 = β 2 = 0.9, and initial learning rate η 0 = 10 −3 . The learning rate η is annealed at a rate of 0.5 every 10 epochs (Dozat and Manning, 2017). We apply early-stopping based on the labeled F 1 score on the development set. 9 We set the maximum number of iterations of AD 3 to 500 and round decisions when it doesn't converge. We clip the 2 norm of gradients to 1 (Graves, 2013;Sutskever et al., 2014), and we do not use mini-batches. Randomly initialized parameters are sampled from a uniform distribution over − 6/(d r + d c ), 6/(d r + d c ) , where d r and d c are the number of the rows and columns in the matrix, respectively. An 2 penalty of λ = 10 −6 is applied to all weights. Other hyperparameters are summarized in Table 3.
We use the same pruner as Martins and Almeida (2014), where a first-order feature-rich unlabeled pruning model is trained for each task, and arcs with posterior probability below 10 −4 are discarded. We further prune labeled structures that appear less than 30 times in the training set. In the development set, about 10% of the arcs remain after pruning, with a recall of around 99%.
use the POS tags provided with the respective dataset; for the multitask models, we use the (automatic) POS tags provided with DM. 9 Micro-averaged labeled F1 for the multitask models.

Experiments
Experimental settings. We compare four multitask variants to the basic model, as well as the two baseline systems introduced in §3.4. • SHARED1 is a first-order model. It uses a single shared BiLSTM encoder, and keeps the inference separate for each task. • FREDA1 is a first-order model based on "frustratingly easy" parameter sharing. It uses a shared encoder as well as task-specific ones. The inference is kept separate for each task. • SHARED3 is a third-order model. It follows SHARED1 and uses a single shared BiLSTM encoder, but additionally employs cross-task structures and inference. • FREDA3 is also a third-order model. It combines FREDA1 and SHARED3 by using both "frustratingly easy" parameter sharing and cross-task structures and inference. In addition, we also examine the effects of syntax by comparing our models to the state-of-the-art open track system (Almeida and Martins, 2015). 10 Main results overview. Table 4a compares our models to the best published results (labeled F 1 score) on SemEval 2015 Task 18 in-domain test set. Our basic model improves over all closed track entries in all formalisms. It is even with the best open track system for DM and PSD, but improves on PAS and on average, without making use of any syntax. Three of our four multitask variants further improve over our basic model; SHARED1's differences are statistically insignificant. Our best models (SHARED3, FREDA3) outperform the previous state-of-the-art closed track system by 1.7% absolute F 1 , and the best open track system by 0.9%, without the use of syntax.
We observe similar trends on the out-of-domain test set (Table 4b), with the exception that, on PSD, our best-performing model's improvement over the open-track system of Almeida and Martins (2015) is not statistically significant.
The extent to which we might benefit from syntactic information remains unclear. With automatically generated syntactic parses, Almeida and Martins (2015) manage to obtain more than 1% absolute improvements over their closed track en- 10 Kanerva et al. (2015) was the winner of the gold track, which overall saw higher performance than the closed and open tracks. Since gold-standard syntactic analyses are not available in most realistic scenarios, we do not include it in this comparison.   A&M, 2015 (open), the strongest baseline system. try, which is consistent with the extensive evaluation by , but we leave the incorporation of syntactic trees to future work. Syntactic parsing could be treated as yet another output task, as explored in Lluís et al. (2013) and in the transition-based frameworks of Henderson et al. (2013) and Swayamdipta et al. (2016).
Effects of structural overlap. We hypothesized that the overlap between formalisms would enable multitask learning to be effective; in this section we investigate in more detail how structural overlap affected performance. By looking at undirected overlap between unlabeled arcs, we discover that modeling only arcs in the same direction may have been a design mistake.
DM and PAS are more structurally similar to each other than either is to PSD.   malisms in unlabeled F 1 score (each formalism's gold-standard unlabeled graph is used as a prediction of each other formalism's gold-standard unlabeled graph). All three formalisms have more than 50% overlap when ignoring arcs' directions, but considering direction, PSD is clearly different; PSD reverses the direction about half of the time it shares an edge with another formalism. A concrete example can be found in Figure 1, where DM and PAS both have an arc from "Last" to "week," while PSD has an arc from "week" to "Last." We can compare FREDA3 to FREDA1 to isolate the effect of modeling higher-order structures. Table 6 shows performance on the development data in both unlabeled and labeled F 1 . We can see that FREDA3's unlabeled performance improves on DM and PAS, but degrades on PSD. This supports our hypothesis, and suggests that in future work, a more careful selection of structures to model might lead to further improvements.

Related Work
We note two important strands of related work.
Graph-based parsing. Graph-based parsing was originally invented to handle non-projective syntax (McDonald et al., 2005;Koo et al., 2010;Martins et al., 2013, inter alia), but has been adapted to semantic parsing Martins and Almeida, 2014;Kuhlmann, 2014, inter alia). Local structure scoring was traditionally done with linear models over hand-engineered features, but lately, various forms of representation learning have been explored to learn feature combinations Taub-Tabib et al., 2015;Pei et al., 2015, inter alia). Our work is perhaps closest to those who used BiLSTMs to encode inputs (Kiperwasser and Goldberg, 2016;Kuncoro et al., 2016;Wang and Chang, 2016;Dozat and Manning, 2017;Ma and Hovy, 2016).
Multitask learning in NLP. There have been many efforts in NLP to use joint learning to replace pipelines, motivated by concerns about cascading errors. Collobert and Weston (2008) proposed sharing the same word representation while solving multiple NLP tasks. Zhang and Weiss (2016) use a continuous stacking model for POS tagging and parsing. Ammar et al. (2016) and Guo et al. (2016) explored parameter sharing for multilingual parsing. Johansson (2013) and Kshirsagar et al. (2015) applied ideas from domain adaptation to multitask learning. Successes in multitask learning have been enabled by advances in representation learning as well as earlier explorations of parameter sharing (Ando and Zhang, 2005;Blitzer et al., 2006;Daumé III, 2007).

Conclusion
We showed two orthogonal ways to apply deep multitask learning to graph-based parsing. The first shares parameters when encoding tokens in the input with recurrent neural networks, and the second introduces interactions between output structures across formalisms. Without using syntactic parsing, these approaches outperform even state-of-the-art semantic dependency parsing systems that use syntax. Because our techniques apply to labeled directed graphs in general, they can easily be extended to incorporate more formalisms, semantic or otherwise. In future work we hope to explore cross-task scoring and inference for tasks where parallel annotations are not available. Our code is opensource and available at https://github. com/Noahs-ARK/NeurboParser.