Simpler but More Accurate Semantic Dependency Parsing

While syntactic dependency annotations concentrate on the surface or functional structure of a sentence, semantic dependency annotations aim to capture between-word relationships that are more closely related to the meaning of a sentence, using graph-structured representations. We extend the LSTM-based syntactic parser of Dozat and Manning (2017) to train on and generate these graph structures. The resulting system on its own achieves state-of-the-art performance, beating the previous, substantially more complex state-of-the-art system by 0.6% labeled F1. Adding linguistically richer input representations pushes the margin even higher, allowing us to beat it by 1.9% labeled F1.


Introduction
The 2014 SemEval shared task on Broad-Coverage Semantic Dependency Parsing (Oepen et al., 2014) introduced three new dependency representations that do away with the assumption of strict tree structure in favor of a richer graphstructured representation, allowing them to capture more linguistic information about a sentence. This opens up the possibility of providing more useful information to downstream tasks (Reddy et al., 2017;Schuster et al., 2017), but increases the difficulty of automatically extracting that information, since most previous work on parsing has focused on generating trees.
Syntactic dependency parsing is arguably the most popular method for automatically extracting the low-level relationships between words in a sentence for use in natural language understanding tasks. However, typical syntactic dependency frameworks are restricted to being tree structures, which limits the number and types of relationships that they can capture. For example, in the sentence Mary wants to buy a book, the word Mary is the subject of both want and buy-either or both relationships could be useful in a downstream task, but a tree-structured representation of this sentence (as in Figure 1a) can only represent one of them. 1 Dozat and Manning (2017) developed a successful syntactic dependency parsing system with few task-specific sources of complexity. In this paper, we extend that system so that it can train on and produce the graph-structured data of semantic dependency schemes. We also consider straightforward extensions of the system that are likely to increase performance over the straightforward baseline, including giving the system access to lemma embeddings and building in a characterlevel word embedding model. Finally, we briefly examine some of the design choices of that architecture, in order to assess which components are necessary for achieving the highest accuracy and which have little impact on final performance.

Semantic dependencies
The 2014 SemEval (Oepen et al., 2014(Oepen et al., , 2015 shared task introduced three new semantic dependency formalisms, applied to the Penn Treebank (shown in Figure 1, compared to Universal Dependencies (Nivre et al., 2016)): DELPH-IN MRS, or DM (Flickinger et al., 2012;Oepen and Lønning, 2006); Predicate-Argument Structures, or PAS (Miyao and Tsujii, 2004); and Prague Semantic Dependencies, or PSD (Hajic et al., 2012). Whereas syntactic dependencies generally annotate functional relationships between words-such  Figure 1: Comparison between syntactic and semantic dependency schemes as subject and object-semantic dependencies aim to reflect semantic relationships-such as agent and patient (cf. semantic role labeling (Gildea and Jurafsky, 2002)). Finally, the SemEval semantic dependency schemes are directed acyclic graphs (DAGs) instead of trees, allowing them to annotate function words as being heads without lengthening paths between content words (as in 1b).

Related work
Our approach to semantic dependency parsing is primarily inspired by the success of  and  at syntactic dependency parsing and  at semantic dependency parsing. In  and , parsing involves first using a multilayer bidirectional LSTM over word and part-of-speech tag embeddings. Parsing is then done using directly-optimized selfattention over recurrent states to attend to each word's head (or heads), and labeling is done with an analgous multi-class classifier. Peng et al.'s (2017) system uses a max-margin classifer on top of a BiLSTM, with the score for each graph coming from several sources. First, it scores each word as either taking dependents or not. For each ordered pair of words, it scores the arc from the first word to the second. Lastly, it scores each possible labeled arc between the two words. The graph that maximizes these scores may not be consistent, with an edge coming from a non-predicate, for example, so they enforce hard constraints in order to prune away invalid semantic graphs. Decisions are not independent, so in order to find the highest-scoring graph that follows these constraints, they use the AD 3 decoding algorithm (Martins et al., 2011).  approach to syntactic dependency parsing is similar, but avoids the possibility of generating invalid trees by fully factorizing the system. Rather than summing the scores from multiple modules and then finding the valid structure that maximizes that sum, the system makes parsing and labeling decisions sequentially, choosing the labels for each edge only after the edges in the tree have been finalized by an MST algorithm. Wang et al. (2018) take a different approach in their recent work, using a transition-based parser built on stack-LSTMs (Dyer et al., 2015). They extend Choi and McCallum's (2013) transition system for producing non-projective trees so that it can produce arbitrary DAGs and they modify the stack-LSTM architecture slightly to make the network more powerful.

Basic approach
We can formulate the semantic dependency parsing task as labeling each edge in a directed graph, with null being the label given to pairs with no edge between them. Using only one module that labels each edge in this way would be an unfactorized approach. We can, however, factorize it into two modules: one that predicts whether or not a directed edge (w j , w i ) exists between two words, and another that predicts the best label for each potential edge.
Our approach closely follows that of . As with many successful recent parsers, we concatenate word and POS tag 2 embeddings, and feed them into a multilayer bidirectional LSTM to get richer representations. 3 For each of the two modules, we use single-layer feedforward networks (FNN) to split the top recurrent states into two parts-a head representation, as in Eq. (5, 6) and a dependent representation, as in Eq. (7,8). This allows us to reduce the recurrent size to avoid overfitting in the classifer without weakening the LSTM's capacity. We can then use bilinear or biaffine classifiers in Eq. (3, 4)-which are generalizations of linear classifiers to include multiplicative interactions between two vectors-to predict edges and labels. 4 The tensor U can optionally be diagonal (such that u i,k,j = 0 wherever i = j) to conserve parameters. The unlabeled parser (trained with sigmoid crossentropy) scores every edge between pairs of words in the sentence-these scores can be decoded into a graph by keeping only edges that received a positive score. The labeler (trained with softmax cross-entropy) scores every label for each pair of words, so we simply assign each predicted edge its highest-scoring label. We can train the system by summing the losses, backpropagating error to the labeler only through gold edges. This system is shown graphically in Figure 2. We find that sometimes the loss for one module overwhelms the loss for the other, causing the system to underfit. Thus we add a tunable interpolation constant λ ∈ (0, 1) to even out the two losses.
Worth noting is that the removal of the maximum spanning tree algorithm and change from softmax cross-entropy to sigmoid cross-entropy in the unlabeled parser represent the only changes needed to allow the original syntactic parser to generate fully graph-structured semantic dependency output. Note also that this system is general enough that it could be used for any graphstructured dependency scheme, including the enhanced dependencies of the Universal Dependencies formalism (which allows cyclic graphs).

Augmentations
Ballesteros et al. (2016), , and Ma et al. (2018) find that character-level word embedding models improve performance for syntactic dependency parsing, so we also want to explore the impact it has on semantic dependency parsing.  confirm that their syntactic parser performs better with POS tags, which leads us to examine whether word lemmas-another form of low-level lexical information-might also improve dependency parsing performance.

Hyperparameters
We tuned the hyperparameters for our basic system (with no character embeddings or lemmas) fairly extensively on the DM development data. The hyperparameter configuration for our final system is given in Table 2. All input embeddings (word, pretrained, POS, etc.) were concatenated. We used 100-dimensional pretrained GloVe embeddings (Pennington et al., 2014), but linearly transformed them to be 125-dimensional. Only words or lemmas that occurred 7 times or more were included in the word and lemma embedding matrix-including less frequent words appeared to facilitate overfitting. Character-level word embeddings were generated using a onelayer unidirectional LSTM that convolved over three character embeddings at a time, whose end state was linearly transformed to be 100-  dimensional. The core BiLSTM was three layers deep. The different types of word embeddingsword, GloVe, and character-level-were dropped simultaneously, but independently from POS and lemma embeddings (which were dropped independently from each other). Dropped embeddings were replaced with learned <DROP> tokens. LSTMs used same-mask recurrent dropout (Gal and Ghahramani, 2016). The systems were trained with batch sizes of 3000 tokens for up to 75,000 training steps, terminating early after 10,000 steps pass with no improvement in validation accuracy. Table 1 compares our performance with these systems. We use biaffine classifiers, with no nonlinearities, and a diagonal tensor in the label classifier but not the edge classifier. The system trains at a speed of about 300 sequences/second on an Nvidia Titan X and parses about 1,000 sequences/second. Du et al. (2015) and Almeida and Martins (2015) are the systems that won the 2015 shared task (closed track). PTS17: Basic represents the single-task versions of , which they make multitask across the three datasets in Freda3 by adding frustratingly easy domain adaptation (Daumé III, 2007;Kim et al., 2016) and a third-order decoding mechanism. WCGL18 is Wang et al.'s (2018) transition-based system. Our fully factorized basic system already substantially outperforms Peng et al.'s single-task baseline and also beats out their much more complex multi-task approach. Simply adding in either a characterlevel word embedding model (similar to Dozat et al.'s (2017)) or a lemma embedding matrix likewise improves performance quite a bit, and including both together generally pushes performance even higher. Many infrequent words were excluded from the frequent token embedding matrix, so it makes sense that the system should improve when provided more lexical information that's harder to overfit on. Surprisingly, the PAS dataset seems not to benefit substantially from lemma or character embeddings. It has been noted that PAS is the easiest of the three datasets to achieve good performance for; so one possible explanation is that 94% LF1 may  Figure 3: Performance of architecture variations: our basic system; unfactorized (labeler-only); ommitting the hidden layers (Eqs. 5-8); with bilinear classifiers (Eq. 3); with nondiagonal tensors in the labeler or diagonal tensors in the parser; with the ReLU nonlinearity.

Performance
simply be near the ceiling of what can be achieved for the dataset. Alternatively, the main difference bewteen PAS as DM/PSD is that PAS includes semantically vacuous function words in its representation. Because function words are extremely frequent, it's possible that they are being disproportionately represented in the loss or LF1 score. Using a hinge loss (like ) instead of a cross-entropy loss might help, since the system would stop focusing on potentially "easy" functional predicates once it learned to predict their argument structures confidently, allowing it to put more resources into modeling more challenging phenomena.

Variations
We also consider the impact that slight variations on the basic architecture have on final performance in Figure 3. We train twenty models on the DM treebank for each variation we consider, reducing the number of training steps but keeping all other hyperparameters constant. Rank-sum tests (Lehmann et al., 1975) reveal that the basic system outperforms variants with no hidden layers in the edge classifier (W =339; p<.001) or the label classifier (W =307; p<.01). Using a diagonal tensor U in the unlabeled parser also significantly hurts performance (W =388; p<.001), likely being too underpowered. While the other variations (especially the unfactorized and ReLU systems) appeared to make a difference during hyperparameter tuning, they were not significant here.
The improved performance of deeper systems (replicating ) likely jus-tifies the added complexity. On the other hand, the choice between biaffine and bilinear classifiers comes down largely to aesthetics. This is perhaps unsurprising since the change from biaffine to bilinear represents only a small decrease in overall power. Unusually, using no nonlinearity in the hidden layers in Eqs. (5-8) works as well as ReLU-in fact, using ReLU in the unlabeled parser marginally reduced performance (W =269; p=.063). Overall, the parser displayed considerable invariance to architecture changes. Since our system is significantly larger and more heavily regularized than the systems we compare against, this suggests that unglamorous, lowlevel hyperparameters-such as hidden sizes and dropout rates-are more critical to system performance than high-level architecture enhancements.

Discussion
We minimally extended a simple syntactic dependency parser to produce graph-structured dependencies. Without any further augmentations, our carefully-tuned system achieves state-of-theart performance, highlighting the importance of finding the best hyperparameter configuration (and by extension, building fast systems that can be trained quickly). Additionally, we can see that a multitask system relying on a complex decoding algorithm to prune away invalid graph structures isn't necessary for achieving the level of parsing performance a simple system can achieve (though it could push performance even higher). We also find easier or independently motivated ways to improve accuracy-taking advantage of provided lemma or subtoken information provides a boost comparable to one found by drastically increasing system complexity.
Further, we observe a high-performing graphbased parser can be adapted to different types of dependency graphs (projective tree, nonprojective tree, directed graph) with only small changes without obviously hurting accuracy. By contrast, transition-based parsers-which were originally designed for parsing projective constituency trees (Nivre, 2003;Aho and Ullman, 1972)-require whole new transition sets or even data structures to generate arbitrary graphs. We feel that this points to graph-based parsers being the most natural way to produce dependency graphs with different structural restrictions.