Second-Order Semantic Dependency Parsing with End-to-End Neural Networks

Semantic dependency parsing aims to identify semantic relationships between words in a sentence that form a graph. In this paper, we propose a second-order semantic dependency parser, which takes into consideration not only individual dependency edges but also interactions between pairs of edges. We show that second-order parsing can be approximated using mean field (MF) variational inference or loopy belief propagation (LBP). We can unfold both algorithms as recurrent layers of a neural network and therefore can train the parser in an end-to-end manner. Our experiments show that our approach achieves state-of-the-art performance.


Introduction
Semantic dependency parsing (Oepen et al.) aims to produce graph-structured semantic dependency representations of sentences instead of treestructured syntactic dependency parses.Existing approaches to semantic dependency parsing can be classified as graph-based approaches and transition-based approaches.In this paper, we investigate graph-based approaches which score each possible parse of a sentence by factorizing over its parts and search for the highest-scoring parse.
Previous work in graph-based syntactic dependency parsing has shown that higher-order parsing generally outperforms first-order parsing (Mc-Donald and Pereira, 2006;Carreras, 2007;Koo and Collins, 2010;Ma and Zhao, 2012).While a first-order parser scores dependency edges independently, a higher-order parser takes relationships between two or more edges into consideration.However, most of the previous algorithms for higher-order syntactic dependency tree parsing are not applicable to semantic dependency graph parsing, and designing efficient algorithms for higher-order semantic dependency graph parsing is nontrivial.In addition, it becomes a common practice to use neural networks to compute features and scores of parse graph components, which ideally requires backpropagation of parsing errors through the higher-order parsing algorithm, adding to the difficulty of designing such an algorithm.
In this paper, we propose a novel graph-based second-order semantic dependency parser.Given an input sentence, we use a neural network to compute scores for both first and second-order parts of parse graphs and then apply either mean field variational inference or loopy belief propagation to approximately find the highest-scoring parse graph.Both algorithms are iterative inference algorithms and we show that they can be unfolded as recurrent layers of a neural network with each layer representing the computation in one iteration of the algorithms.In this way, we can construct an end-to-end neural network that takes in a sentence and outputs the approximate marginal probability of every possible dependency edge.
During training, we maximize the probability of the gold parses by using standard gradient-based methods.Our experiments show that our approach achieves state-of-the-art performance in semantic dependency parsing and outperforms our baseline with 0.3% and 0.4% labeled F1 score and previous state-of-the-art model with 1.3% and 1.4% labeled F1 score for in-domain and out-of-domain test sets respectively.Our approach shows more advantage over the baseline when there are fewer training data and when parsing longer sentences.

Semantic Dependency Parsing
[s (+,-) ; s ()/) ; s s (67-'6)   h , (+,-) ; h , ()/) ; h , (01/) h < (+,-) ; h < ()/) ; h < (01/)  (Oepen et al., 2015) with an additional out-of-domain dataset (the Brown corpus).A semantic dependency parse is different from a syntactic dependency parse in that the dependency edges are annotated with semantic relations (e.g., agent and patient) and form a directed acyclic graph instead of a tree.The Broad-Coverage Semantic Dependency Parsing provides three different formalisms: DM, PAS and PSD.Previous work has found that PAS is the easiest to learn and PSD is the most difficult as it has the largest set of labels.

Approach
Our model architecture (shown in Figure 1) follows that of Dozat and Manning (2018).Given an input sentence, we first compute word representations using a BiLSTM, which are then fed into two parallel modules, one for predicting the existence of every edge and the other for predicting the label of every edge.The label-prediction module makes predictions of each edge independently and hence is a first-order decoder.The edge-prediction module is what our approach differs from that of Dozat and Manning (2018).The module scores both first and second-order parts and then goes through multiple recurrent inference layers to predict edge existence.

Part Scoring
Given a sentence with n words [w 1 , w 2 , ..., w n ], we feed a BiLSTM with their word embeddings and POS tag embeddings.
where o i is the concatenation (⊕) of the word and POS tag embeddings of word w i , O represents [o 1 , . . ., o n ], and R = [r 1 , . . ., r n ] represents the output from the BiLSTM.
To score first-order parts (edges) in both the edge-prediction module and the label-prediction module, we use two single-layer feedforward networks (FNNs) to compute a head representation and a dependent representation for each word and then apply a biaffine function to compute the scores of edges and labels.
In Eq. 2, the tensor U in the biaffine function is i = j, u i,c,j = 0), where d is hidden size and c is the number of labels.In Eq. 1, the tensor U in the biaffine function is d × 1 × d-dimensional.
In the edge-prediction module, we further score second-order parts.We consider three types of second-order parts: siblings (sib), co-parents (cop) and grandparents (gp) (Martins and Almeida, 2014), as shown in Figure 2.For a specific type of second-order part, we use single-layer FNNs to compute a head representation and a dependent representation for each word.For grandparent parts, we additionally compute a head_dep representation for each word.
type ∈ {sib, cop, gp} We then apply a trilinear function to compute scores of second-order parts.A trilinear function is defined as follows.
where U is a (d × d × d)-dimensional tensor.To reduce the computation cost, we assume that U has rank d and can be represented as the product of three (d × d)-dimensional matrices U 1 , U 2 and U 3 .We can then compute second-order part scores as follows.

Inference
In the label-prediction module, s (label) is fed into a softmax layer that outputs the probability of each label for edge (i, j).In the edge-prediction module, computing the edge probabilities can be seen as doing posterior inference on a Conditional Random Field (CRF).The corresponding factor graph is shown in Figure 3.Each Boolean variable X ij in the CRF indicate whether the directed edge (i, j) exists.We use Eq. 1 to define our unary potential ψ u representing scores of an edge and Eqs.(4-6) to define our binary potential ψ p .We define a unary potential φ u (X ij ) for each variable X ij .
For each pair of edges (i, j) and (k, l) that form a second-order part of a specific type, we define a binary potential φ p (X ij , X kl ).
Exact inference on this CRF is intractable.We resort to iterative approximate inference algorithms as described below, which produce the posterior distribution Q ij (X ij ) of for each edge (i, j).We can then predict the parse graph by including every edge (i, j) such that Q ij (1) > 0.5.The edge labels are predicted by maximizing the label probabilities computed by the label-prediction module.

Mean Field Variational Inference
Mean field variational inference approximates a true posterior distribution with a factorized variational distribution and tries to iteratively minimize their KL divergence.We can derive the following iterative update equations of distribution ) is set by normalizing the unary potential φ u (X ij ).We iteratively update the distributions for T steps and then output where T is a hyperparameter.

Loopy Belief Propagation
Loopy belief propagation iteratively passes messages between variables and potential functions (factors).Because our CRF contains only unary and binary potentials, we can merge each variableto-factor message and its subsequent factor-tovariable message into a single variable-to-variable message M kl→ij , representing message from edge (k, l) to edge (i, j).The update function of the messages in each iteration is: We initialize the messages with M (0) kl→ij = 1.We iteratively update the messages and distributions for T steps and then output Inference as Recurrent Layers Zheng et al. (2015) proposed that a fixed number of iterations in mean field variational inference can be seen as a recurrent neural network that is parameterized by the potential functions.We follow the idea and unfold both mean field variational inference and loopy belief propagation as recurrent neural network layers that are parameterized by part scores.
The time complexity of our inference procedure is O(n 3 ), which is lower than the O(n 4 ) complexity of the exact quasi-second-order inference of Cao et al. (2017) and on par with the complexity of the approximate second-order inference of Martins and Almeida (2014).

Learning
Given a gold parse graph y of sentence w, the conditional distribution over possible edge y (edge) ij and corresponding possible label y (label) ij is given by: We define the following cross entropy losses: where θ is the parameters of our model, 1(y (edge) ij ) denotes the indicator function and equals 1 when edge (i, j) exists in the gold parse and 0 otherwise, and i, j ranges over all the words in the sentence.We optimize the weighted average of the two losses.
where λ is a hyperparameter.

Hyperparameters
We tuned the hyperparameters of our baseline model from Dozat and Manning (2018) and our second-order model on the DM development set.We followed Dozat and Manning (2018) using 100-dimensional pretrained GloVe embeddings (Pennington et al., 2014) and transformed them to be 125-dimensional.Words and lemmas appeared less than 7 times are replaced with a special unknown token.We use the same dataset split as in previous approaches (Martins and Almeida, 2014;Du et al., 2015) with 33,964 sentences in the training set, 1,692 sentences in the development set,

Main Results
We compare our model with previous state-of-theart approaches in  (2015).PTS17 proposed by (Peng et al., 2017) and Basic is single task parsing while Freda3 is a multitask parser across three formalisms.WCGL18 (Wang et al., 2018) is a neural transition-based model.D&M (Dozat and Manning, 2018) is a graph-based model and "Baseline" is the first-order model from Dozat and Manning (2018) that was trained by ourselves.For our model, we used mean field variational inference and loopy belief propagation for 3 iterations.
In the basic setting, on average our model outperforms the best previous one by 1.3% on the indomain test set and 1.3% on the out-of-domain test set.With lemma and character-based embeddings, our model leads to an average improvement of 0.3% and 0.6% over previous models.Our model also outperforms the baseline by 0.2% − −0.5% on average with different settings and test sets.Dozat and Manning (2018) found that on the PAS dataset their model cannot benefit from lemma and character-based embeddings and hence speculated that they may have approached the ceiling of the PAS F1 score.As shown in our experiments on the PAS dataset, our model cannot benefit from lemma and character-based embeddings either, but it obtains higher F1 scores, which suggests that the ceiling may not have been reached.
Note that while we do not force our parser to predict a directed acyclic graph, we found that only 0.7% of the test sentences have cycles in their parses.

Analysis Small Training Data
To evaluate the performance of our model on smaller training data, we repeated our experiments with randomly sampled 70%, 40% and 10% of the training set.Table 3 shows the F1 scores averaged over 5 runs (each time with a new randomly sampled training subset).It can be seen that the advantage of our model over the baseline increases significantly when the training data becomes smaller.We make the following speculation to explain this observation.The BiLSTM layer in the baseline and our model is capable of capturing high-order information to some extent.However, without prior knowledge of high-order parts, it may require more training data to learn this capability than a high-order decoder.So with small training data, the baseline loses the capability of utilizing highorder information, while our model can still rely on the decoder for high-order parsing.

Performance on Different Sentence Lengths
We want to study the impact of sentence lengths on first-order parsing and our second-order parsing.We split the test sets of all the formalisms into five subsets with different sentence length ranges and evaluate our model and the baseline on them.
Figure 4 shows that our model has more advantage over the baseline when sentences get longer, especially when sentences are longer than 40.One possible explanation is that BiLSTM has difficulty in capturing long-range dependencies in long sentences, which leads to lower performance on the first-order baseline; but such long-range dependencies can still be captured with second-order parsing.It can also be seen that on long sentences, our model has more advantage over the baseline for the out-of-domain test set than for the in-domain test set, which suggests that our model has better generalizability especially on long sentences.

Mean Field vs. Loopy Belief Propagation
We compare mean field variational inference and loopy belief propagation algorithms in Table 4.
We tuned the hyperparameters of our model for each algorithm and iteration number separately.We find that in general mean field variational inference has very similar performance to loopy belief propagation.In addition, with more iterations, the performance of mean field variational inference steadily increases while the at the second iteration.

Ablation Study
We study how different types of second-order parts defined in Section 3.1 affect the performance of our parser.We trained our model with each   type of second-order parts without the other two types on the DM dataset using mean field variational inference and the result is shown in Table 5.While all the three types of second-order parts can be seen to improve the parsing performance over the baseline, the sibling parts lead to the largest performance gain on both the in-domain test set and the out-of-domain test set.

Case Study
We provide a parsing example in Figure 5 to show how our second-order parser with 3 iterations of mean field variational inference works.Before the first iteration, the marginal distributions of edges Q ij is initialized with unary potentials and thus is exactly what a first-order parser would produce.In the subsequent iterations, the distributions are updated with binary potentials taken into account.For each version of the distributions, we can extract a parse graph by collecting edges with probabilities larger than 0.5.From Figure 5, we can see that erroneous edges are gradually fixed through iterations.Edge (were, P oles) sends a strong negative co-parents message to edge (<T OP >, P oles) in the first iteration, so the latter has a lower probability in subsequent iterations.Edge (were, P oles) also sends a strong positive grandparent message to edge (<T OP >, were) to enhance its probability, and the latter sends an increasingly positive message back to the former in subsequent iterations.In the second and third iterations, (were, P oles) sends positive sibling messages and (<T OP >, were) sends positive grandparent messages to enhance probabilities of edges (were, T hey) and (were, not), which finally leads to the correct parse.( speed on an Nvidia Tesla P40 server.The result is shown in Table 6.Mean field variational inference slows down training and parsing by 35% and 20% respectively compared with the baseline.However, loopy belief propagation slows down training and parsing by 65% and 67% respectively compared with the baseline.

Significance Test
We trained 25 basic models of our approach and the baseline with the same hyperparameters in Table 1 et al. (2018) further proposed to learn from different corpora.Dozat and Manning (2018) proposed a graph-based simple but powerful neural network for semantic dependency parsing using a bilinear or biaffine (Dozat and Manning, 2016) layer to encode the interaction between words.
Most of these approaches proposed first-order dependency parser while Martins and Almeida (2014) proposed a way to encode higher-order parts with hand-crafted features and introduced a novel co-parent part for semantic dependency parsing.They used discrete optimizing algorithm alternating directions dual decomposition (AD 3 ) as their decoder.Cao et al. (2017) also proposed a quasi-second-order semantic dependency parser with dynamic programming.Our model contains second-order information comparing with the first-order approaches and benefits from endto-end training comparing with other second-order approaches.

Higher-Order Dependency Parsing
Higher-order parsing has been extensively studied in the literature of syntactic dependency parsing.Much of these work is based on the first-order maximum spanning tree (MST) parser of McDonald et al. (2005) which factorizes a dependency tree into individual edges and maximizes the summation of the scores of all the edges in a tree.Mc-Donald and Pereira (2006) introduced a secondorder MST that factorizes a dependency tree into not only edges but also second-order sibling parts, which allows interactions between adjacent sibling words.Carreras (2007) defined second-order grandparent parts representing grandparental relations.Koo and Collins (2010) introduced thirdorder grand-sibling and tri-sibling parts.A grandsibling part represents a grandparent with two grandchildren and a tri-sibling part represents a parent with three children.Ma and Zhao (2012) defined grand-tri-sibling parts for fourth-order dependency parsing.
Many previous approaches to higher-order dependency parsing perform exact decoding based on dynamic programming, but there is also research in approximate higher-order parsing.Martins et al. (2011) proposed an alternating directions dual decomposition (AD 3 ) algorithm which splits the original problem into several local subproblems and solves them iteratively.They employed AD 3 for second-order dependency parsing to speed up decoding.Smith and Eisner (2008) and Gormley et al. (2015) proposed to use belief propagation for approximate higher-order parsing, which is closely related to our work.
While higher-order parsing has been shown to improve syntactic dependency parsing accuracy, it receives less attention in semantic dependency parsing.Martins and Almeida (2014) proposed second-order semantic dependency parsing and employed AD 3 for approximate decoding.Cao et al. (2017) proposed a quasi-second-order parser and used dynamic programming for decoding with time complexity of O(n 4 ).

CRF as Recurrent Neural Networks
Zheng et al. ( 2015) are probably the first to propose the idea of unfolding iterative inference algorithms on CRFs as a stack of recurrent neural network layers.They unfolded mean field variational inference in a neural network designed for semantic segmentation.There is a lot of subsequent work that employs this technique, especially in the computer vision area.For example, Zhu et al. (2017) proposed a structured attention neural model for Visual Question Answering with a CRF over image regions and unfolded both mean field variational inference and loopy belief propagation algorithms as recurrent layers.

Conclusion
We proposed a novel graph-based second-order parser for semantic dependency parsing.We constructed an end-to-end neural network that uses a trilinear function to score second-order parts and finds the highest-scoring parse graph by either mean field variational inference or loopy belief propagation algorithms unfolded as recurrent neural network layers.Our experimental results show that our model outperforms previous state-of-the-art model and has higher accuracies especially on out-of-domain data and long sentences.Our code is publicly available at https://github.com/wangxinyu0922/Second_Order_SDP

Figure 4 :
Figure 4: Relative improvements over our baseline in different sentence length intervals with different training data sizes.We report the average F1 score improvements over all the formalisms with 5 runs for each.

4. 5 Figure 5 :
Figure5: An example of message passing (left) and the corresponding graph parses (right) in our second-order parser with mean field variational inference.We regard terms in Eq. 7 as messages sent from other arcs.Blue arcs and red arcs on the left represent positive messages (which encourage the target edge to exist) and negative messages (which discourage the target edge to exist) respectively.Lightness of the arc color represents the message intensity.Blackness of each nodes represents the probability of edge existence.A Node with a double circle means the corresponding edge is predicted to exist.Messages with low intensities are omitted in the graph.Dotted arcs and red arcs on the right represent missed predictions and wrong predictions compared to the golden parse.The period in the sentence is omitted for simplicity.

Table 1 :
Hyperparameter for baseline and second-order models in our experiment.

Table 2 .
Du et al. (2015)is a hybrid model.A&M is from Almeida and Martins

Table 2 :
Comparison of labeled F1 scores achieved by our model and previous state-of-the-arts.The F1 scores of Baseline and our models are averaged over 5 runs.ID denotes the in-domain (WSJ) test set and denotes the out-of-domain (Brown) test set.+Char and +Lemma means augmenting the token embeddings with character-level and lemma

Table 3 :
Comparison of labeled F1 scores achieved by our model and our baseline on 10%, 40%, 70% of the training data.We report the average F1 score over 5 runs with different randomly sampled training data.

Table 5 :
The performance comparison between three types of second-order parts on the DM dataset.

Table 6 :
Training and parsing speed (sentences/second) comparison of the baseline and our model (3 iterations for our second-order parser).long means the parsing speed on sentences longer than 40 and short means the parsing speed on sentences no longer than 10.