Multi-Task Semantic Dependency Parsing with Policy Gradient for Learning Easy-First Strategies

In Semantic Dependency Parsing (SDP), semantic relations form directed acyclic graphs, rather than trees. We propose a new iterative predicate selection (IPS) algorithm for SDP. Our IPS algorithm combines the graph-based and transition-based parsing approaches in order to handle multiple semantic head words. We train the IPS model using a combination of multi-task learning and task-specific policy gradient training. Trained this way, IPS achieves a new state of the art on the SemEval 2015 Task 18 datasets. Furthermore, we observe that policy gradient training learns an easy-first strategy.


Introduction
Dependency parsers assign syntactic structures to sentences in the form of trees.Semantic dependency parsing (SDP), first introduced in the Se-mEval 2014 shared task (Oepen et al., 2014), in contrast, is the task of assigning semantic structures in the form of directed acyclic graphs to sentences.SDP graphs consist of binary semantic relations, connecting semantic predicates and their arguments.A notable feature of SDP is that words can be the semantic arguments of multiple predicates.For example, in the English sentence: "The man went back and spoke to the desk clerk" -the word "man" is the subject of the two predicates "went back" and "spoke".SDP formalisms typically express this by two directed arcs, from the two predicates to the argument.This yields a directed acyclic graph that expresses various relations among words.However, the fact that SDP structures are directed acyclic graphs means that we cannot apply standard dependency parsing algorithms to SDP.
Standard dependency parsing algorithms are often said to come in two flavors: transition-based The man went back and spoke to the desk clerk.

a) DM
The man went back and spoke to the desk clerk.

b) PAS
The man went back and spoke to the desk clerk.parsers score transitions between states, and gradually build up dependency graphs on the side.Graph-based parsers, in contrast, score all candidate edges directly and apply tree decoding algorithms for the resulting score table.The two types of parsing algorithms have different advantages (McDonald and Nivre, 2007), with transitionbased parsers often having more problems with error propagation and, as a result, with long-distance dependencies.This paper presents a compromise between transition-based and graph-based parsing, called iterative predicate selection (IPS) -inspired by head selection algorithms for dependency parsing (Zhang et al., 2017) -and show that error propagation, for this algorithm, can be reduced by a combination of multi-task and reinforcement learning.
Multi-task learning is motivated by the fact that there are several linguistic formalisms for SDP.Fig. 1 shows the three formalisms used in the shared task.The DELPH-IN MRS (DM) formalism derives from DeepBank (Flickinger et al., 2012) and minimal recursion semantics (Copestake et al., 2005).Predicate-Argument Structure (PAS) is a formalism based on the Enju HPSG parser (Miyao et al., 2004) and is generally considered slightly more syntactic of nature than the arXiv:1906.01239v1[cs.CL] 4 Jun 2019 other formalisms.Prague Semantic Dependencies (PSD) are extracted from the Czech-English Dependency Treebank (Hajič et al., 2012).There are several overlaps between these linguistic formalisms, and we show below that parsers, using multi-task learning strategies, can take advantage of these overlaps or synergies during training.Specifically, we follow Peng et al. (2017) in using multi-task learning to learn representations of parser states that generalize better, but we go beyond their work, using a new parsing algorithm and showing that we can subsequently use reinforcement learning to prevent error propagation and tailor these representations to specific linguistic formalisms.
Contributions In this paper, (i) we propose a new parsing algorithm for semantic dependency parsing (SDP) that combines transition-based and graph-based approaches; (ii) we show that multitask learning of state representations for this parsing algorithm is superior to single-task training; (iii) we improve this model by task-specific policy gradient fine-tuning; (iv) we achieve a new state of the art result across three linguistic formalisms; finally, (v) we show that policy gradient fine-tuning learns an easy-first strategy, which reduces error propagation.

Related Work
There are generally two kinds of dependency parsing algorithms, namely transition-based parsing algorithms (McDonald and Nivre, 2007;Kiperwasser and Goldberg, 2016;Ballesteros et al., 2015) and graph-based ones (McDonald and Pereira, 2006;Zhang and Clark, 2008;Galley and Manning, 2009;Zhang et al., 2017).In graphbased parsing, a model is trained to score all possible dependency arcs between words, and decoding algorithms are subsequently applied to find the most likely dependency graph.The Eisner algorithm (Eisner, 1996) and the Chu-Liu-Edmonds algorithm are often used for finding the most likely dependency trees, whereas the AD 3 algorithm (Martins et al., 2011) is used for finding SDP graphs that form DAGs in Peng et al. (2017) and Peng et al. (2018).During training, the loss is computed after decoding, leading the models to reflect a structured loss.The advantage of graphbased algorithms is that there is no real error propagation to the extent the decoding algorithms are global inference algorithm, but this also means that reinforcement learning is not obviously applicable to graph-based parsing.In transition-based parsing, the model is typically taught to follow a gold transition path to obtain a perfect dependency graph during training.This training paradigm has the limitation that the model only ever gets to see states that are on gold transition paths, and error propagation is therefore likely to happen when the parser predicts wrong transitions leading to unseen states (McDonald and Nivre, 2007;Goldberg and Nivre, 2013).
There have been several attempts to train transition-based parsers with reinforcement learning: Zhang and Chan (2009) applied SARSA (Baird III, 1999) to an Arc-Standard model, using SARSA updates to fine-tune a model that was pre-trained using a feed-forward neural network.Fried and Klein (2018), more recently, presented experiments with applying policy gradient training to several constituency parsers, including the RNNG transition-based parser (Dyer et al., 2016).
In their experiments, however, the models trained with policy gradient did not always perform better than the models trained with supervised learning.We hypothesize this is due to credit assignment being difficult in transition-based parsing.Iterative refinement approaches have been proposed in the context of sentence generation (Lee et al., 2018).Our proposed model explores multiple transition paths at once and avoids making risky decisions in the initial transitions, in part inspired by such iterative refinement techniques.We also pre-train our model with supervised learning to avoid sampling from irrelevant states at the early stages of policy gradient training.
Several models have been presented for DAG parsing (Sagae and Tsujii, 2008;Ribeyre et al., 2014;Tokgöz and Gülsen, 2015;Hershcovich et al., 2017).Wang et al. (2018) proposed a similar transition-based parsing model for SDP; they modified the possible transitions of the Arc-Eager algorithm (Nivre and Scholz, 2004b) to create multi-headed graphs.We are, to the best of our knowledge, first to explore reinforcement learning for DAG parsing.

Iterative Predicate Selection
We propose a new semantic dependency parsing algorithm based on the head-selection algorithm for syntactic dependency parsing (Zhang et al., The man went back and spoke to the desk clerk.
The man went back and spoke to the desk clerk.
The man went back and spoke to the desk clerk.
The man went back and spoke to the desk clerk.
The man went back and spoke to the desk clerk.
The man went back and spoke to the desk clerk.2017).Head selection iterates over sentences, fixing the head of a word w in each iteration, ignoring w in future iterations.This is possible for dependency parsing because each word has a unique head word, including the root of the sentence, which is attached to an artificial root symbol.However, in SDP, words may attach to multiple head-words or semantic predicates whereas other words may not attach to any semantic predicates.Thus, we propose an iterative predicate selection (IPS) parsing algorithm, as a generalization of head-selection in SDP.
The proposed algorithm is formalized as follows.First, we define transition operations for all words in a sentence.For the i-th word w i in a sentence, the model selects one transition t τ i from the set of possible transitions T τ i for each transition time step τ .Generally, the possible transitions T i for the i-th word are expressed as follows: where ARC i,j is a transition to create an arc from the j-th word to the i-th word, encoding that the semantic predicate w j takes w i as an semantic argument.NULL is a special transition that does not create an arc.The set of possible transitions T τ i for the i-th word at time step τ is a subset of possible transitions T i that satisfy two constraints: (i) no arcs can be reflexive, i.e., w i cannot be an argument of itself, and (ii) the new arc must not be a member of the set of arcs A τ comprising the partial parse graph y τ constructed at time step τ .Therefore, we obtain: The model then creates semantic dependency arcs by iterating over the sentence as follows: 1 1 This algorithm can introduce circles.However, circles 1 For each word w i , select a head arc from T τ i .
2 Update the partial semantic dependency graph.
3 If all words select NULL, the parser halts.Otherwise, go to 1.
Fig. 2 shows the transitions of the IPS algorithm during the DM parsing of the sentence "The man went back and spoke to the desk clerk."In this case, there are several paths from the initial state to the final parsing state, depending on the orders of creating the arcs.This is known as the nondeterministic oracle problem (Goldberg and Nivre, 2013).In IPS parsing, some arcs are easy to predict; others are very hard to predict.Long-distance arcs are generally difficult to predict, but they are very important for down-stream applications, including reordering for machine translation (Xu et al., 2009).Since long-distance arcs are harder to predict, and transition-based parsers are prone to error propagation, several easy-first strategies have been introduced, both in supervised (Goldberg and Elhadad, 2010;Ma et al., 2013) and unsupervised dependency parsing (Spitkovsky et al., 2011), to prefer some paths over others in the face of the non-deterministic oracle problem.Easy-first principles have also proven effective with sequence taggers (Tsuruoka and Tsujii, 2005;Martins and Kreutzer, 2017).In this paper, we take an arguably more principled approach, learning a strategy for choosing transition paths over others using reinforcement learning.We observe, however, that the learned strategies exhibit a clear easy-first preference.
were extremely rare in our experiments, and can be avoided by simple heuristics during decoding.We discuss this issue in the Supplementary Material, §A.1.

Neural Model
Fig. 3 shows the overall neural network.It consists of an encoder for input sentences and partial SDP graphs, as well as a multi-layered perceptron (MLP) for the semantic head-selection of each word.
Sentence encoder We employ bidirectional long short-term memory (BiLSTM) layers for encoding words in sentences.A BiLSTM consists of two LSTMs that reads the sentence forward and backward, and concatenates their output before passing it on.For a sequence of tokens [w 1 , • • • , w n ], the inputs for the encoder are words, POS tags and lemmas.2They are mapped to the same p-dimensional embedding vectors in a look-up table.Then they are concatenated to form 3p-dimensional vectors and used as the input of BiLSTMs.We denote the mapping function of tokens into 3p-dimensional vectors as u(w * ) for later usages.Finally, we obtain the hidden representations of all words [h(w 1 ), from the three-layer BiLSTMs.We use three-layer stacked BiLSTMs.We also use special embeddings h NULL for the NULL transition and h ROOT for the ROOT of the sentence.

Encoder of partial SDP graphs
The model updates the partial SDP graph at each time step of the parsing procedure.The SDP graph y τ at time step τ is stored in a semantic dependency matrix G τ ∈ {0, 1} n×(n+1) for a sentence of n words. 3he rows of the matrix G represent arguments and the columns represent head-candidates, including the ROOT of the sentence, which is represented by the first column of the matrix.For each transition for a word, the model fills in one cell in a row, if the transition is not NULL.In the initial state, all cells in G are 0. A cell G[i, j] is updated to 1, when the model predicts that the (i − 1)-th word is an argument of the j-th word or ROOT when j = 0. We convert the semantic dependency matrix G into a rank three tensor G ∈ R n×(n+1)×p , by replacing elements with embeddings of tokens u(w * ) by where g ij ∈ G and g ij ∈ G .g i * contains the representations of the semantic predicates for the i-th word in the partial SDP graph.We use a single layer Bi-LSTM to encode the semantic predicates g i * of each word; see Fig. 3 (b).Finally, we concatenate the hidden representation of the NULL transition and obtain the partial SDP graph representation G τ of the time step τ : We also employ dependency flags that directly encode the semantic dependency matrix and indicate whether the corresponding arcs are already created or not.Flag representations F are also three-rank tensors, consisting of two hidden representations: f ARC for g i,j = 1 and f NOARC for g i,j = 0 depending on G. f ARC and f NOARC is q-dimensional vectors.Then we concatenate the hidden representation of the NULL transition and obtain the flag representation F τ : . We do not use BiLSTMs to encode these flags.These flags also reflect the current state of the semantic dependency matrix.

Predicate selection model
The semantic predicate selection model comprises an MLP with inputs from the encoder of the sentence and the partial semantic dependency graph: the sentence representation H, the SDP representation G τ , and the dependency flag F τ .They are rank three tensors and concatenated at the third axis.Formally, the score s ij of the i-th word and the j-th transition is expressed as follows.
For the MLP, we use a concatenation of outputs from three different networks: a three-layer MLP, a two-layer MLP and a matrix multiplication with bias terms as follows.
These transition probabilities p i (t j ) of selecting a semantic head word w j , are defined for each word w i in a sentence.
For supervised learning, we employ a cross entropy loss for the partial SDP graph G τ at time step τ .
Here l i is a gold transition label for the i-th word and θ represents all trainable parameters.Note that this supervised training regime, as mentioned above, does not have a principled answer to the non-deterministic oracle problem (Goldberg and Nivre, 2013), and samples transition paths randomly from those consistent with the gold anntoations to create transition labels.Labeling model We also develop a semantic dependency labeling neural network.This neural network consists of three-layer stacked BiLSTMs and a MLP for predicting a semantic dependency label between words and their predicates.We use a MLP that is a sum of the outputs from a threelayer MLP, a two-layer MLP and a matrix multiplication.Note that the output dimension of this MLP is the number of semantic dependency labels.The input of this MLP is the hidden representations of a word i and its predicates j: [h i , h j ] extracted from the stacked BiLSTMs.The score s ij (l) of the label l for the arc from predicate j to word i is predicted as follows.
We minimize the softmax cross entropy loss using supervised learning.

Reinforcement Learning
Policy gradient Reinforcement learning is a method for learning to iteratively act according to a dynamic environment in order to optimize future rewards.In our context, the agent corresponds to the neural network model predicting the transition probabilities p i (t τ j ) that are used in the parsing algorithm.The environment includes the partial SDP graph y τ , and the rewards r τ are computed by comparing the predicted parse graph to the gold parse graph y g .
We adapt a variation of the policy gradient method (Williams, 1992) for IPS parsing.Our objective function is to maximize the rewards and the transition policy for the i-th word is given by the probability of the transitions π ∼ p i (t τ j |y τ ).The gradient of Eq.8 is given as follows: When we compute this gradient, given a policy π, we approximate the expectation E π for any transition sequence with a single transition path t that is sampled from policy π: We summarize our policy gradient learning algorithm for SDP in Algorithm 1.For time step τ , the model samples one transition t τ j selecting the j-th word as a semantic head word of the ith word, from the set of possible transitions T i , following the transition probability of π.After sampling t τ j , the model updates the SDP graph to y τ +1 and computes the reward r τ i .When NULL becomes the most likely transition for all words, or the time step exceeds the maximum number of time steps allowed, we stop. 4For each time step, we then update the parameters of our model with the gradients computed from the sampled transitions and their rewards. 5ote how the cross entropy loss and the policy gradient loss are similar, if we do not sample from the policy π, and rewards are non-negative.However, these are the important differences between supervised learning and reinforcement learning: (1) Reinforcement learning uses sampling of transitions.This allows our model to explore transition paths that supervised models would never follow.(2) In supervised learning, decisions are independent of the current time step τ , while in reinforcement learning, decisions depend on τ .This means that the θ parameters are updated after the parser finishes parsing the input sentence.(3) Loss Reward Transitions (1) The model creates a new correct arc from a semantic predicate to the i-th word.
(2) The first time the model chooses the NULL transition after all gold arcs to the i-th word have been created, and no wrong arcs to the i words have not been created.
The model creates a wrong arc from a semantic predicate candidate to the i-th word.r τ i = 0 (4) All other transitions.
Table 1: Rewards in SDP policy gradient.
must be non-negative in supervised learning, while rewards can be negative in reinforcement learning.
In general, the cross entropy loss is able to optimize for choosing good transitions given a parser configuration, while the policy gradient objective function is able to optimize the entire sequence of transitions drawn according to the current policy.We demonstrate the usefulness of reinforcement learning in our experiments below.
Rewards for SDP We also introduce intermediate rewards, given during parsing, at different time steps.The reward r τ i of the i-th word is determined as shown in Table 1.The model gets a positive reward for creating a new correct arc to the i-th word, or if the model for the first time chooses a NULL transition after all arcs to the i-th word are correctly created.The model gets a negative reward when the model creates wrong arcs.When our model chooses NULL transitions for the i-th word before all gold arcs are created, the reward r τ i becomes 0.

Implementation Details
This section includes details of our implementation. 6We use 100-dimensional, pre-trained Glove (Pennington et al., 2014) word vectors.Words or lemmas in the training corpora that do not appear in pre-trained embeddings are associated with randomly initialized vector representations.Embeddings of POS tags and other special symbol are also randomly initialized.We apply Adam as our optimizer.Preliminary experiments show that mini-batching led to a degradation in performance.
When we apply policy gradient, we pre-train our model using supervised learning.We then use policy gradient for task-specific fine-tuning of our model.We find that updating parameters of BiL-STM and word embeddings during policy gradient  makes training quite unstable.Therefore we fix the BiLSTM parameters during policy gradient.In our multi-task learning set-up, we apply multi-task learning of the shared stacked BiLSTMs (Søgaard and Goldberg, 2016;Hashimoto et al., 2017) in supervised learning.We use task-specific MLPs for the three different linguistic formalisms: DM, PAS and PSD.We train the shared BiLSTM using multi-task learning beforehand, and then we finetune the task-specific MLPs with policy gradient.We summarize the rest of our hyper-parameters in Table 2.

Experiments
We use the SemEval 2015 Task18 (Oepen et al., 2015) SDP dataset for evaluating our model.The training corpus contains 33,964 sentences from the WSJ corpus; the development and in-domain test were taken from the same corpus and consist of 1,692 and 1,410 sentences, respectively.The outof-domain test set of 1,849 sentences is drawn from Brown corpus.All sentences are annotated with three semantic formalisms: DM, PAS and PSD.We use the standard splits of the datasets (Almeida and Martins, 2015;Du et al., 2015).Following standard evaluation practice in semantic dependency parsing, all scores are micro-averaged F-measures (Peng et al., 2017;Wang et al., 2018) with labeled attachment scores (LAS).The system we propose is the IPS parser trained with a multi-task objective and fine-tuned using reinforcement learning.This is referred to as IPS+ML+RL in the results tables.To highlight the contributions of the various components of our architecture, we also report ablation scores for the IPS parser without multi-task training nor reinforcement learning (IPS), with multi-task training (IPS+ML) and with reinforcement learning (IPS+RL).At inference time, we apply heuristics to avoid predicting circles during decoding (Camerini et al., 1980); see Supplementary Material, §A.1.This improves scores by 0.1 % or less, since predicted circles are extremely rare.We compare our proposed system with three state-ofthe-art SDP parsers: Freda3 of Peng et al. (2017), the ensemble model in Wang et al. (2018) and Peng et al. (2018).In Peng et al. (2018), they use syntactic dependency trees, while we do not use them in our models. 7he results of our experiments on in-domain dataset are also shown in Table 3.We observe that our basic IPS model achieves competitive scores in DM and PAS parsing.Multi-task learning of the shared BiLSTM (IPS+ML) leads to small improvements across the board, which is consistent with the results of Peng et al. (2017).The model trained with reinforcement learning (IPS+RL) performs better than the model trained by supervised learning (IPS).These differences are significant (p < 10 −3 ).Most importantly, the combination of multi-task learning and policy gradient-based reinforcement learning (IPS+ML+RL) achieves the best results among all IPS models and the previous state of the art models, by some margin.We also obtain similar results for the out-of-domain  the position of chief financial officer , who will be hired from within the agency.
Within weeks the unfolding Iran-Contra scandal took away Mr. Noriega's insurance policy.
Morgan will help evaluate DFC's position and help determine alternatives.
The U.S. Commerce Department reported a $ 10.77 billion deficit in August compared with ...
lead the industry with a strong sales performance in the human and animal health-products segment.5: Evaluation of our parser when not using lemma embeddings (for a more direct comparison with Freda3), on in-domain test datasets.‡ of +RL models represents that the scores are statistically significant at p < 10 −3 with their non-RL counterparts.
datasets, as shown in Table 4.All improvements with reinforcement learning are also statistically significant (p < 10 −3 ).
Evaluating Our Parser without Lemma Since our baseline (Peng et al., 2017) does not rely on neither lemma or any syntactic information, we also make a comparison of IPS+ML and IPS+ML+RL trained with word and POS embeddings, but without lemma embeddings.The results are given in Table 5.We see that our model is still better on average and achieves better performance on all three formalisms.We also notice that the lemma information does not improve the performance in the PAS formalism.
Effect of Reinforcement Learning Fig. 4 shows the distributions of the length of the created arcs in the first, second, third and fourth transitions for all words, in the various IPS models in the development corpus.These distributions show the length of the arcs the models tend to create in the first and later transitions.Since long arcs are harder to predict, an easy-first strategy would typically amount to creating short arcs first.
In supervised learning (IPS+ML), there is a slight tendency to create shorter arcs first, but while the ordering is relatively consistent, the differences are small.This is in sharp contrast with the distributions we see for our policy gradient parser (IPS+ML+RL).Here, across the board, it is very likely that the first transition connects neighboring words; and very unlikely that neighboring words are connected at later stages.This suggests that reinforcement learning learns an easyfirst strategy of predicting short arcs first.Note that unlike easy-first algorithms in syntactic parsing (Goldberg and Nivre, 2013), we do not hardwire an easy-first strategy into our parser; but rather, we learn it from the data, because it optimizes our long-term rewards.We present further analyses and analyses on WSJ syntactic dependency trees in Appendix A.2. Fig. 5 shows four sentence excerpts from the development corpus, and the order in which arcs are created.We again compare the model trained with supervised learning (IPS+ML notated as SL here) to the model with reinforcement learning (IPS+ML+RL notated as RL here).In examples (a) and (b), the RL model creates arcs inside noun phrases first and then creates arcs to the verb.The SL model, in contrast, creates arcs with inconsistent orders.There are lots of similar examples in the development data.In clause (c), for example, it seems that the RL model follows a grammatical ordering, while the SL model does not.In the clause (d), it seems that the RL model first resolves arcs from modifiers, in "chief financial officer", then creates an arc from the adjective phrase ", who will be hired", and finally creates an arc from the external phrase "the position of ".Note that both the SL and RL models make an arc from "of " in stead of the annotated label of the word "position" in the phrase "the position of ".In the clause (e), the RL model resolve the arcs in the noun phrase "a strong sales performance" and then resolve arcs from the following prepositional phrase.Finally, the RL model resolve the arc from the word "with" that is the headword in the syntactic dependency tree.In the example (d) and (e), the RL model elaborately follows the syntactic order that are not given in any stages of training and parsing.

Conclusion
We propose a novel iterative predicate selection (IPS) parsing model for semantic dependency parsing.We apply multi-task learning to learn general representations of parser configurations, and use reinforcement learning for task-specific fine-tuning.In our experiments, our multi-task reinforcement IPS model achieves a new state of the art for three SDP formalisms.Moreover, we show that fine-tuning with reinforcement learning learns an easy-first strategy and some syntactic features.Table 6: The number of graphs and the percentages of graphs with circles when not using heuristics to avoid circles.Results on the development sets.We also note circles are mostly small and local.So they do not affect other arc structures.

A.1 DAG formalism
Our proposed parsing algorithm possibly introduces circles in the resulting graphs.However, they are few in our experiments.Table 6 shows the number of graphs with circles and their relative frequency in our predictions for the development sets.Here we propose additional decoding rules to strictly prevent making these circles.Our decoding is iterative.Therefore when the newly arcs A are added to the existing partial SDP graph y τ , the sum of y τ ∪ A must not contain some circles.In that case, we swap arcs in A with other arcs that do not lead to circles, in the following way: 1 Search for arcs that make circles C in y τ ∪ A.
If no circles (most in the case), return y τ ∪ A as a partial SDP graph and go to the next transition.
3 From arcs in C ∩ A\B, choose one arc a that have the lowest probability of the softmax output and add to the buffer B.
4 For the reduced arc a, choose another arc (including NULL) that have the next largest probability of the softmax output of the arc a and add it to the buffer B .
5 Check if y τ ∪ A\B have circles.If no circles, add B to A and go to 1. Otherwise, go to 3.
The resulting graphs do not contain circles and form DAGs.This operation slightly improves the performance, but does not affect the results much, because circles are rare.

A.2 Arc Length Analysis on WSJ Syntactic Trees
As a supplementary experiment, we wanted to explore the relation between the learned easy-first strategies and syntactic dependency lengths.The intuition was that attachments of length n that are consistent with syntactic dependencies may be easier than attachments of length n that are not consistent with syntactic dependencies, regardless of n.We therefore provide a quantitative analysis of the SDP parsing order using syntactic dependency trees of WSJ corpus as a reference.The syntactic dependency trees are extracted from WSJ constituency trees with the LTH Penn Converter. 89 For this, we consider the subset of undirected SDP arcs that match syntactic dependencies, i.e., where the semantic predicate words and the semantic argument words are in ascendant or descendant relations in the syntactic dependency trees.The directed SDP arcs that agree with the syntactic dependencies in direction, i.e., such that the syntac- tic head is also the semantic predicate, are said to have positive answer, as in the analysis above, while SDP arcs that are opposite have negative distance, because they go against the syntactic order.We show the distributions of length of arcs to semantic headwords that are created from the 1st to the 4th transitions from semantic argument words.We consider words that have four or more semantic headwords; this is because the models finish transitions for words that have fewer semantic headwords in the early transitions.
Fig. 6 presents the distributions of arc length from the 1st to the 4th transitions.These graphs are similar to Fig. 4, but only represent a subset of arcs, and now with a distinction between positive and negative arcs, depending on whether they agree with the syntactic analysis.We present the DM and PAS graphs, because we had too few examples for PSD.The graph (a) of supervised learning shows the same tendency from 1st to 4th transitions.The graph (b) of reinforcement learning shows a different picture: For PAS, the IPS+ML+RL model tends to resolve the short arcs that are consistent with the syntactic ones first (the blue line is higher in the positive span).In DM, however, the model tends to resolve the short arcs that are inconsistent with syntactic dependencies first.
Fig. 7 presents the averaged arc lengths at each transition step.The left column (a) is the average length of the created arcs at transition steps one to four.The right column (b) is the same relative to syntactic trees, using the same subsample as above and encoding disagreeing arcs with negative length values.In supervised learning, the averaged arc length does not vary much across transition steps, neither with respect to sentence positions nor with respect to syntactic trees.However, in reinforcement learning, the averaged arc length varies a lot across transition steps, relative to both sentence positions and agreement with syntactic headedness.The graphs in the (a) column suggest that the reinforcement learning model has a strong tendency to resolve adjacent arcs first.In the (b) column, we note that for reinforcement learning, the trajectories for DM and PAS are opposite.For PAS, early arcs are in line with syntactic dependencies, whereas for DM the opposite picture emerges.
Fig. 8 presents two example clauses with gold syntactic trees the partial SDP graphs, decorated with the parsing orders of our supervised baseline model (SL) and our reinforcement learning model (RL).RL resolves the adjacent words first, and the parsing orders of clauses (a) and (b) are consistent.The two PAS graphs and syntactic trees are relatively similar, but the arc directions are different, and the arcs from ROOT go to different words.The RL model prefers to create arcs that agree with the syntactic arcs first.We also note that the parsing orders of the SL models seem inconsistent across (a) and (b).

Figure 1 :
Figure 1: Semantic dependency parsing arcs of DM, PAS and PSD formalisms.

Figure 2 :
Figure 2: Construction of semantic dependency arcs (DM) in the IPS parsing algorithm.Parsing begins from the initial state and proceeds to the final state following one of several paths.In the left path, the model resolves adjacent arcs first.In contrast, in the right path, distant arcs that rely on the global structure are resolved first.

Figure 3 :
Figure 3: Our network architecture: (a) The encoder of the sentence into the hidden representations h i and h j , and the MLP for the transition probabilities.(b) The encoder of the semantic dependency matrix for the representation of h d ij .The MLP also takes the arc flag representation f ij (see text for explanation).

Figure 4 :
Figure 4: Arc length distributions: (a) Supervised learning (IPS+ML).(b) Reinforcement learning (IPS+ML+RL).The four lines correspond to the first to fourth transitions in the derivations.

Figure 5 :
Figure 5: Examples of clauses parsed with DM formalism.The underlined words are the semantic predicates of the argument words in rectangles in the annotation.The superscript numbers (SL) are the orders of creating arcs by IPS+ML and the subscript numbers (RL) are the orders by IPS+ML+RL.In the clause (a), we show a partial SDP graph to visualize the SDP arcs.

Figure 6 :
Figure 6: Arc length distributions on the WSJ syntactic dependency trees.(a) The length distribution of arcs with supervised learning (IPS+ML).(b) The length distribution of arcs with reinforcement learning (IPS+ML+RL).The four lines correspond to the first four transitions in the derivations.The horizontal axis corresponds to the length of the created arcs.The rightward of the horizontal axis corresponds to arcs from ascendants, while the leftward corresponds to arcs from descendants.The black arrows in the bottom figures illustrate the changes of distributions from the 1st to later distributions.They are in the opposite directions between DM and PAS.

Figure 7 :Figure 8 :
Figure 7: The average arc length comparisons of the supervised learning model (IPS+ML) and the reinforcement learning model (IPS+ML+RL).(a) Average arc length in sentences.The horizontal axis is the length of arcs in terms of words as of Fig. 4. The vertical axis corresponds to the first four transitions of models.(b) Average arc length in dependency trees.The horizontal axis is the same with Fig. 6.
Algorithm 1 Policy gradient learning for IPS Algorithm Input: Sentence x with an empty parsing tree y 0 .Let a time step τ = 0 and finish flags f * = 0. for 0 ≤ τ < the number of maximum iterations do Compute π τ and argmax transitions ti = arg max π τ i .Update the parsing tree y τ to y τ +1 .Compute a new reward r τ i from y τ , y τ +1 and y g .end for

Table 2 :
Hyper-parameters in our experiments.

Table 3 :
Labeled parsing performance on in-domain test data.Avg. is the micro-averaged score of three formalisms.‡ of the +RL models represents that the scores are statistically significant at p < 10 −3 with their non-RL counterparts.

Table 4 :
‡ 88.8 ‡ 77.7 ‡ 85.3 ‡ Labeled parsing performance on out-ofdomain test data.Avg. is the micro-averaged score of three formalisms.‡ of the +RL models represents that the scores are statistically significant at p < 10 −3 with their non-RL counterparts.