Transition-based Semantic Dependency Parsing with Pointer Networks

Transition-based parsers implemented with Pointer Networks have become the new state of the art in dependency parsing, excelling in producing labelled syntactic trees and outperforming graph-based models in this task. In order to further test the capabilities of these powerful neural networks on a harder NLP problem, we propose a transition system that, thanks to Pointer Networks, can straightforwardly produce labelled directed acyclic graphs and perform semantic dependency parsing. In addition, we enhance our approach with deep contextualized word embeddings extracted from BERT. The resulting system not only outperforms all existing transition-based models, but also matches the best fully-supervised accuracy to date on the SemEval 2015 Task 18 datasets among previous state-of-the-art graph-based parsers.


Introduction
In dependency parsing, the syntactic structure of a sentence is represented by means of a labelled tree, where each word is forced to be attached exclusively to another that acts as its head. In contrast, semantic dependency parsing (SDP) (Oepen et al., 2014) aims to represent binary predicate-argument relations between words of a sentence, which requires producing a labelled directed acyclic graph (DAG): not only semantic predicates can have multiple or zero arguments, but words from the sentence can be attached as arguments to more than one head word (predicate), or they can be outside the SDP graph (being neither a predicate nor an argument) as shown in the examples in Figure 1. Since existing dependency parsers cannot be directly applied, most SDP research has focused on adapting them to deal with the absence of singlehead and connectedness constraints and to produce an SDP graph instead. As in dependency parsing, we can find two main families of approaches to efficiently generate accurate SDP graphs. On the one hand, graph-based algorithms have drawn more attention since adapting them to this task is relatively straightforward. In particular, these globally optimized methods independently score arcs (or sets of them) and then search for a high-scoring graph by combining these scores. From one of the first graph-based DAG parsers proposed by McDonald and Pereira (2006) to the current state-of-the-art models He and Choi, 2019), different graph-based SDP approaches have been presented, providing accuracies above their main competitors: transitionbased DAG algorithms.
A transition-based parser generates a sequence of actions to incrementally build a valid graph (usually from left to right). This is typically done by local, greedy prediction and can efficiently parse a sentence in a linear or quadratic number of actions (transitions); however, the lack of global inference makes them more prone to suffer from error propa-gation: i.e., since transitions are sequentially and locally predicted, an erroneous action can affect future predictions, having a significant impact in long sentences and being, to date, less appealing for SDP. In fact, in recent years only a few contributions, such as the system developed by Wang et al. (2018), present a purely transition-based SDP parser. It is more common to find hybrid systems that combine transition-based approaches with graph-based techniques to alleviate the impact of error propagation in accuracy (Du et al., 2015), but this penalizes the efficiency provided by transition-based algorithms.
Away from the current mainstream, we present a purely transition-based parser that directly generates SDP graphs without the need of any additional techniques. We rely on Pointer Networks (Vinyals et al., 2015) to predict transitions that can attach multiple heads to the same word and incrementally build a labelled DAG. This kind of neural networks provide an encoder-decoder architecture that is capable of capturing information from the whole sentence and previously created arcs, alleviating the impact of error propagation and already showing remarkable results in transition-based dependency parsing (Ma et al., 2018;. We further enhance our neural network with deep contextualized word embeddings extracted from the pre-trained language model BERT (Devlin et al., 2019).
The proposed SDP parser 1 can process sentences in SDP treebanks (where structures are sparse DAGs with a low in-degree) in O(n 2 log n) time, or O(n 2 ) without cycle detection. This is more efficient than the current fully-supervised state-of-theart system by  (O(n 3 ) without cycle detection), while matching its accuracy on the SemEval 2015 Task 18 datasets (Oepen et al., 2015). In addition, we also prove that our novel transition-based model provides promising accuracies in the semi-supervised scenario, achieving some state-of-the-art results.

Related Work
An early approach to DAG parsing was implemented as a modification to a graph-based parser by McDonald and Pereira (2006). This produced DAGs using approximate inference by first finding a dependency tree, and then adding extra edges that would increase the graph's overall score. A few years later, this attempt was outperformed by the first transition-based DAG parser by Sagae and Tsujii (2008). They extended the existing transition system by Nivre (2003) to allow multiple heads per token. The resulting algorithm was not able to produce DAGs with crossing dependencies, requiring the pseudo-projective transformation by Nivre and Nilsson (2005) (plus a cycle removal procedure) as a post-processing stage.
More recently, there has been a predominance of purely graph-based DAG models since the Se-mEval 2015 Task 18 (Oepen et al., 2015). Almeida and Martins (2015) adapted the pre-deep-learning dependency parser by Martins et al. (2013) to produce SDP graphs. This graph-based parser encodes higher-order information with hand-crafted features and employs the AD 3 algorithm (Martins et al., 2011) to find valid DAGs during decoding. This was extended by  with BiLSTM-based feature extraction and multitask learning: the three formalisms considered in the shared task were jointly learned to improve final accuracy.
After the success of  in graphbased dependency parsing, Dozat and Manning (2018) proposed minor adaptations to use this biaffine neural architecture to produce SDP graphs. To that end, they removed the maximum spanning tree algorithm (Chu and Liu, 1965;Edmonds, 1967) necessary for decoding well-formed dependency trees and simply kept those edges with a positive score. In addition, they trained the unlabelled parser with a sigmoid cross-entropy (instead of the original softmax one) in order to accept multiple heads.
The parser by Dozat and Manning (2018) was recently improved by two contributions. Firstly,  manage to add second-order information for score computation and then apply either mean field variational inference or loopy belief propagation information to decode the highestscoring SDP graph. While significantly boosting parsing accuracy, the original O(n 2 ) runtime complexity is modified to O(n 3 ) in the resulting SDP system. Secondly, He and Choi (2019) significantly improve the original parser's accuracy by not only using contextualized word embeddings extracted from BERT (Devlin et al., 2019), but also introducing contextual string embeddings (called Flair) (Akbik et al., 2018), which consist in a novel type of word vector representations based on character-level language modeling. Both extensions,  and (He and Choi, 2019), are currently the state of the art on the SemEval 2015 Task 18 in the fully-supervised and semi-supervised scenarios, respectively. Kurita and Søgaard (2019) have also recently proposed a complex approach that iteratively applies the syntactic dependency parser by Zhang et al. (2017), sequentially building a DAG structure. At each iteration, the graph-based parser selects the highest-scoring arcs, keeping the single-head constraint. The process ends when no arcs are added in the last iteration. The combination of partial parses results in an SDP graph. Since the graph is built in a sequential process, they use reinforcement learning to guide the model through more optimal paths. Following , multi-task learning is also added to boost final accuracy.
On the other hand, the use of transition-based algorithms in the SDP task had been less explored until very recently. Du et al. (2015) presented a voting-based ensemble of fourteen graph-and transition-based parsers. In their work, they noticed that individual graph-based models outperform transition-based algorithms, assigning, during voting, higher weights to them. Among the transition systems used, we can find the one developed by Titov et al. (2009), which is not able to cover all SDP graphs.
We have to wait until the work by Wang et al. (2018) to see that a purely transition-based SDP parser (enhanced with a simple model ensemble technique) can achieve competitive results. They simply modified the preconditions of the complex transition system by Choi and McCallum (2013) to produce unrestricted DAG structures. In addition, their system was implemented by means of stack-LSTMs , enhanced with BiLSTMs and Tree-LSTMs for feature extraction.
We are, to the best of our knowledge, first to explore DAG parsing with Pointer Networks, proposing a purely transition-based algorithm that can be a competitive alternative to graph-based SDP models.
Finally, during the reviewing process of this work, the proceedings of the CoNLL 2019 shared task (Oepen et al., 2019) were released. In that event, SDP parsers were evaluated on updated versions of SemEval 2015 Task 18 datasets, as well as on datasets in other semantic formalisms such as Abstract Meaning Representation (AMR) (Banarescu et al., 2013) and Universal Cognitive Conceptual Annotation (UCCA) (Abend and Rappoport, 2013). Although graph-based parsers achieved better accuracy in the SDP track, several BERT-enhanced transition-based approaches were proposed. Among them we can find an extension (Che et al., 2019) of the system by Wang et al. (2018), several adaptations for SDP (Hershcovich and Arviv, 2019; Bai and Zhao, 2019) of the transition-based UCCA parser by Hershcovich et al. (2017), as well as an SDP variant (Lai et al., 2019) of the constituent transition system introduced by Fernández-González and . Also in parallel to the development of this research, Zhang et al. (2019) proposed a transition-based parser that, while it can be applied for SDP, was specifically designed for AMR and UCCA parsing (where graph nodes do not correspond with words and must be generated during the parsing process). In particular, this approach incrementally builds a graph by predicting at each step a semantic relation composed of the target and source nodes plus the arc label. While this can be seen as an extension of our approach for those tasks where nodes must be generated, its complexity penalizes accuracy in the SDP task.

Multi-head Transition System
We design a novel transition system that is able to straightforwardly attach multiple heads to each word in a single pass, incrementally building, from left to right, a valid SDP graph: a labelled DAG.
To implement it, we use Pointer Networks (Vinyals et al., 2015). These neural networks are able to learn the conditional probability of a sequence of discrete numbers that correspond to positions in an input sequence and, at decoding time, perform as a pointer that selects a position from the input. In other words, we can train this neural network to, given a word, point to the position of the sentence where its head  or dependent words (Ma et al., 2018) are located, depending on what interpretation we use during training. In particular,  proved to be more suitable for dependency parsing than (Ma et al., 2018) since it requires half as many steps to produce the same dependency parse, being not only faster, but also more accurate (as this mitigates the impact of error propagation).
Inspired by Fernández-González and Gómez-Rodríguez (2019), we train a Pointer Network to point to the head of a given word and propose an algorithm that does not use any kind of data structures (stack or buffer, required in classic transitionbased parsers (Nivre, 2008)), but just a focus word pointer i for marking the word currently being processed. More in detail, given an input sentence of n words w 1 , . . . , w n , the parsing process starts with i pointing at the first word w 1 . At each time step, the current focus word w i is used by the Pointer Network to return a position p from the input sentence (or 0, where the ROOT node is located). This information is used to choose between the two available transitions: • If p = i, then the pointed word w p is considered as a semantic head word (predicate) of w i and an Attach-p transition is applied, creating the directed arc w p → w i . The Attach-p transition is only permissible if the resulting predicate-argument arc neither exists nor generates a cycle in the already-built graph, in order to output a valid DAG.
• On the contrary, if p = i (i.e., the model points to the current focus word), then w i is considered to have found all its head words, and a Shift transition is chosen to move i one position to the right to process the next word w i+1 .
The parsing ends when the last word from the sentence is shifted, meaning that the input is completely processed. As stated by Ma et al. (2018) for attaching dependent words, it is necessary to fix the order in which (in our case, head) words are assigned in order to define a deterministic decoding. As the sentence is parsed in a left-to-right manner, we adopt the same order for head assignments. For instance, the SDP graph in Figure 1(a) is produced by the transition sequence described in Table 1. We just need n Shift transitions to move the focus word pointer through the whole sentence and m Attach-p transitions to create the m arcs present in the SDP graph.
It is worth mentioning that we manage to significantly reduce the amount of transitions necessary for generating DAGs in comparison to those proposed in the complex transition systems by Choi and McCallum (2013) and Titov et al. (2009), used in the SDP systems by Wang et al. (2018) and Du et al. (2015), respectively. In addition, the described multi-head transition system is able to p transition focus wordi added arc Attach-6 expectations9 with6 → expectations9 7 Attach-7 expectations9 analysts7 → expectations9 9 Shift .10 10 Shift Table 1: Transition sequence for generating the SDP graph in Figure 1(a).
directly produce any DAG structure without exception, while some transition systems, such as the mentioned (Sagae and Tsujii, 2008;Titov et al., 2009), are limited to a subset of DAGs. Finally, while the outcome of the proposed transition system is a SDP graph without cycles, in other research, such as (Kurita and Søgaard, 2019) and state-of-the-art models by Dozat and Manning (2018) and , the parser is not forced to produce well-formed DAGs, allowing the presence of cycles.
4 Neural Network Architecture 4.1 Basic Approach Vinyals et al. (2015) introduced an encoder-decoder architecture, called Pointer Network, that uses a mechanism of neural attention (Bahdanau et al., 2014) to select positions from the input sequence, without requiring a fixed size of the output dictionary. This allows Pointer Networks to easily address those problems where the target classes considered at each step are variable and depend on the length of the input sequence. We prove that implementing the transition system previously defined on this neural network results in an accurate SDP system.
We follow previous work in dependency parsing (Ma et al., 2018; to design our neural architecture: Encoder A BiLSTM-CNN architecture (Ma and Hovy, 2016) is used to encode the input sentence w 1 , . . . , w n , word by word, into a sequence of encoder hidden states h 1 , . . . , h n . CNNs with max pooling are used for extracting character-level representations of words and, then, each word w i is represented by the concatenation of character (e c i ), word (e w i ), lemma (e l i ) and POS tag (e p i ) embeddings: After that, the x i of each word w i is fed one-by-one into a BiLSTM that captures context information in both directions and generates a vector representation h i : In addition, a special vector representation h 0 , denoting the ROOT node, is prepended at the beginning of the sequence of encoder hidden states.
Decoder An LSTM is used to output, at each time step t, a decoder hidden state s t . As input of the decoder, we use the encoder hidden state h i of the current focus word w i plus extra high-order features. In particular, we take into account the hidden state of the last head word (h h ) attached to w i , which will be a co-parent of a future predicate assigned to w i . Following Ma et al. (2018), we use element-wise sum to add this information without increasing the dimensionality of the input: Note that feature information like this can be easily added in transition-based models without increasing the parser's runtime complexity, something that does not happen in graph-based models, where, for instance, the second-order features added by  penalize runtime complexity. We experimented with other high-order features such as grandparent or sibling information of the current focus word w i , but no significant improvements were obtained from their addition, so they were discarded for simplicity. Further feature exploration might improve parser performance, but we leave this for future work.
Once s t is generated, the attention vector a t , which will work as a pointer over the input, must be computed in the pointer layer. First, following the previously cited work, the scores between s t and each encoder hidden representation h j from the input sentence are computed using this biaffine attention scoring function : where parameter W is the weight matrix of the bilinear term, U and V are the weight tensors of the linear terms and b is the bias vector. In addition, f 1 (·) and f 2 (·) are two single-layer multilayer perceptrons (MLP) with ELU activation, proposed by  for reducing dimensionality and minimizing overfitting. Then, a softmax is applied on the resulting score vector v t to compute a probability distribution over the input words: The resulting attention vector a t can now be used as a pointer to select the highest-scoring position p from the input. This information will be employed by the transition system to choose between the two available actions and create a predicate-argument relation between w p and w i (Attach-p) or move the focus word pointer to w i+1 (Shift). In case the chosen Attach-p is forbidden due to the acyclicity constraint, the next highest-scoring position in a t is considered as output instead. Figure 2 depicts the neural architecture and the decoding procedure for the SDP structure in Figure 1(a).
Label prediction We jointly train a multi-class classifier that scores every label for each pair of words. This shares the same encoder and uses the same biaffine attention function as the pointer: where a distinct weight matrix W l , weight tensors U l and V l and bias b l are used for each label l, where l ∈ {1, 2, . . . , L} and L is the number of labels. In addition, g 1 (·) and g 2 (·) are two singlelayer MLPs with ELU activation. The scoring function is applied over each predicted arc between the dependent word w i (represented by s t ) and the pointed head word w p in position p (represented by h p ) to compute the score of each possible label and assign the highest-scoring one.
Training Objectives The Pointer Network is trained to minimize the negative log likelihood (implemented as cross-entropy loss) of producing the correct SDP graph y for a given sentence x: P θ (y|x). Let y be a DAG for an input sentence x that is decomposed into a set of m directed arcs a 1 , . . . , a m following a left-to-right order. This probability can be factorized as follows: where a <k denotes previous predicted arcs.
On the other hand, the labeler is trained with softmax cross-entropy to minimize the negative log likelihood of assigning the correct label l, given a dependency arc with head word w h and dependent word w i . The whole neural model is jointly trained by summing the parser and labeler losses prior to computing the gradients. In that way, model parameters are learned to minimize the sum of the crossentropy loss objectives over the whole corpus.

Deep Contextualized Word Embeddings Augmentation
In order to further improve the accuracy of our approach, we augment our model with deep contextualized word embeddings provided by the widelyused pre-trained language model BERT (Devlin et al., 2019). Instead of including and training the whole BERT model as encoder of our system, we follow the common, greener and more cost-effective approach of leveraging the potential of BERT by extracting the weights of one or several layers as word-level embeddings. To that end, the pretrained uncased BERT BASE model is used.
Since BERT is trained on subwords (i.e., substrings of the original token), we take the 768dimension vector of each subword of an input token and use the average embedding as the final representation e BERT i . Finally, this is directly concatenated to the resulting basic word representation before feeding the BiLSTM-based encoder: Higher performances can be achieved by summing or concatenating (depending on the task) several layers of BERT; however, exploring these combinations is out of the scope of this paper and we simply use embeddings extracted from the secondto-last hidden layer (since the last layer is biased to the target objectives used to train BERT's language model).

Data
In order to test the proposed approach, we conduct experiments on the SemEval 2015 Task 18 English datasets (Oepen et al., 2015), where all sentences are annotated with three different formalisms: DELPH-IN MRS (DM) (Flickinger et al., 2012), Predicate-Argument Structure (PAS) (Miyao and Tsujii, 2004) and Prague Semantic Dependencies (PSD) (Hajič et al., 2012). Standard split as in previous work (Almeida and Martins, 2015;Du et al., 2015) results in 33,964 training sentences from Sections 00-19 of the Wall Street Journal corpus (Marcus et al., 1993), 1,692 development sentences from Section 20, 1,410 sentences from Section 21 as in-domain test set, and 1,849 sentences sampled from the Brown Corpus (Francis and Kucera,  1982) as out-of-domain test data. For the evaluation, we use the official script, 2 reporting labelled F-measure scores (LF1) (including ROOT arcs) on the in-domain (ID) and out-of-domain (OOD) test sets for each formalism as well as the macroaverage over the three of them.

Settings
We use the Adam optimizer (Kingma and Ba, 2014) and follow (Ma et al., 2018; for parameter optimization. We do not specifically perform hyper-parameter selection for SDP and just adopt those proposed by Ma et al. (2018) for syntactic dependency parsing (detailed in Table 2). For initializing word and lemma vectors, we use the pre-trained structured-skipgram embeddings developed by . POS tag and character embeddings are randomly initialized and all embeddings (except the deep contextualized ones) are fine-tuned during training. Due to random initializations, we report average accuracy over 5 repetitions for each experiment. In addition, during a 500-epoch training, the model with the highest labelled F-score on the development set is chosen. Finally, while further beam-size exploration might improve accuracy, we use beam-search decoding with beam size 5 in all experiments. Table 3 reports the accuracy obtained by state-ofthe-art SDP parsers detailed in Section 2 in comparison to our approach. To perform a fair comparison, we group SDP systems in three blocks dependending on the embeddings provided to the architecture:

Results and Discussion
(1) just basic pre-trained word and POS tag embeddings, (2) character and pre-trained lemma embeddings augmentation and (3) pre-trained deep contextualized embeddings augmentation. As proved by these results, our approach outperforms all existing transition-based models and the widely-used approach by Dozat and Manning (2018) with or without character and lemma embeddings, and it is on par with the best graph-based SDP parser by  on average in the fullysupervised scenario. 3 In addition, our model achieves the best fullysupervised accuracy to date on the PSD formalism, considered the hardest to parse. We hypothesize that this might be explained by the fact that the PSD formalism is the more tree-oriented (as pointed out by Oepen et al. (2015)) and presents a lower ratio of arcs per sentence, being more suitable for our transition-based approach.
In the semi-supervised scenario, BERT-based embeddings proved to be more beneficial for the out-of-domain data. In fact, while not being a fair comparison since we neither include contextual string embeddings (Flair) (Akbik et al., 2018) nor explore different BERT layer combinations, our new transition-based parser manages to outperform the state-of-the-art system by He and Choi (2019) 4 on average on the out-of-domain test set, obtaining a remarkable accuracy on the PSD formalism.

Complexity
Given a sentence with length n whose SDP graph has m arcs, the proposed transition system requires n Shift plus m Attach-p transitions to parse it. Therefore, since a DAG can have at most Θ(n 2 ) edges (as is also the case for general directed graphs), it could potentially need O(n 2 ) transitions in the worst case. However, we prove that this does not happen in practice and real sentences can be  Table 3: Accuracy comparison of state-of-the-art SDP parsers on the SemEval 2015 Task 18 datasets. Gb and T b stand for graph-and transition-based models, +char and +lemma for augmentations with character-level and lemma embeddings, +Flair and +BERT BASE|LARGE for augmentations with deep contextualized character-level and word-level embeddings, and, finally, +MT , +RL and +Ens for the application of multi-task, reinforcement learning and ensemble techniques.
parsed with O(n) transitions instead.
Parsing complexity of a transition-based dependency parsing algorithm can be determined by the number of transitions performed with respect to the number of words in a sentence (Kübler et al., 2009). Therefore, we measure the transition sequence length predicted by the system to analyze every sentence from the development sets of the three available formalisms and depict the relation between them and sentence lengths. As shown in Figure 3, a linear behavior is observed in all cases, proving that the number of Attach-p transitions evaluated by the model at each step is considerably low (behaving practically like a constant). This can be explained by the fact that, on average on the training set, the ratio of predicate-argument dependencies per word in a sentence is 0.79 in DM, 0.99 in PAS and 0.70 in PSD, meaning that the transition sequence necessary for parsing a given sentence will need no more Attach-p transitions than Shift ones (which are one per word in the sentence). It is true that one argument can be attached to more than one predicate; however, the amount of words unattached in the resulting DAG (single-tons) 5 can be significant in some formalisms (as described graphically in Figure 1): on average on the training set, 23% of words per sentence in DM, 6% in PAS and 35% in PSD. In addition, edge density on non-singleton words, computed by Oepen et al. (2015) on the test sets, also backs the linear behavior shown in our experiments: 0.96 in DM, 1.02 in PAS and 1.01 in PSD for the in-domain set and 0.95 in DM, 1.02 in PAS and 0.99 in PSD for the out-of-domain data. In conclusion, we can state that, on the datasets tested, the proposed transition system executes O(n) transitions.
To determine the runtime complexity of the implementation of the transition system, we need to consider the following: firstly, at each transition, the attention vector a t needs to be computed, which means that each of the O(n) transitions takes O(n) time to run. Therefore, the overall time complexity of the parser, ignoring cycle detection, is O(n 2 ). Note that this is in contrast to algorithms like , which takes cubic time even though it does not enforce acyclicity. If we add cycle detection, needed to forbid transitions that would create cycles and therefore to enforce that the output is a DAG, then the complexity becomes O(n 2 log n). This is because an efficient implementation of cycle detection contributes an additive factor of O(n 2 log n) to worst-case time complexity, which becomes the dominant factor. To achieve this efficient implementation, we incrementally keep two data structures: on the one hand, we keep track of weakly connected components using path compression and union by rank, which can be done in inverse Ackermann time, as is com-monly done for cycle detection in tree and forest parsers (Covington, 2001;Gómez-Rodríguez and Nivre, 2010). On the other hand, we keep a weak topological numbering of the graph using the algorithm by Bender et al. (2015), which takes overall O(n 2 log n) time over all edge insertions. When these two data structures are kept, cycles can be checked in constant time: an arc a → b creates a cycle if the involved nodes are in the same weakly connected component and a has a greater topological number than b.
Therefore, the overall expected worst-case running time of the proposed SDP system is O(n 2 log n) for the range of data attested in the experiments, and can be lowered to O(n 2 ) if we are willing to forgo enforcing acyclicity.

Conclusions and Future work
Our multi-head transition system can accurately parse a sentence in quadratic worst-case runtime thanks to Pointer Networks. While being more efficient, our approach outperforms the previous stateof-the-art parser by Dozat and Manning (2018) and matches the accuracy of the best model to date , proving that, with a state-ofthe-art neural architecture, transition-based SDP parsers are a competitive alternative.
By adding BERT-based embeddings, we significantly improve our model accuracy by marginally affecting computational cost, achieving state-ofthe-art F-scores in out-of-domain test sets.
Despite the promising results, the accuracy of our approach could probably be boosted further by experimenting with new feature information and specifically tuning hyper-parameters for the SDP task, as well as using different enhancements such as implementing the hierarchical decoding recently presented by , including contextual string embeddings (Akbik et al., 2018) like He and Choi (2019), or applying multi-task learning across the three formalisms like .