UC Davis at SemEval-2019 Task 1: DAG Semantic Parsing with Attention-based Decoder

We present an encoder-decoder model for semantic parsing with UCCA SemEval 2019 Task 1. The encoder is a Bi-LSTM and the decoder uses recursive self-attention. The proposed model alleviates challenges and feature engineering in traditional transition-based and graph-based parsers. The resulting parser is simple and proved to effective on the semantic parsing task.


Introduction
Semantic parsing aims to capture structural relationships between input strings and graph representations of sentence meaning, going beyond concerns of surface word order, phrases and relationships. The focus on meaning rather than surface relations often requires the use of reentrant nodes and discontinuous structures. Universal Conceptual Cognitive Annotation (UCCA) (Abend and Rappoport, 2013) is designed to support semantic parsing with mappings between sentences and their corresponding meanings in a framework intended to be applicable across languages.
SemEval 2019 Task 1 (Hershcovich et al., 2018b(Hershcovich et al., , 2019 focuses on semantic parsing of texts into graphs consisting of terminal nodes that represent words, non-terminal nodes that represent internal structure, and labeled edges representing relationships between nodes (e.g. participant, center, linker, adverbial, elaborator), according to the UCCA scheme. Annotated datasets are provided, and participants are evaluated in four settings: English with domain-specific data, English with out-of-domain data, German with domainspecific data, and French with only development and test data, but no training data. Additionally, there are open and closed tracks, where the use of additional resources is and is not allowed, respectively. Our entry in the task is limited to the closed track and the first setting, domain-specific English using the Wiki corpus, where the relatively small dataset (4113 sentences for training, 514 for development, and 515 for testing) consists of annotated sentences from English Wikipedia.
Our model follows the encoder-decoder architecture commonly used in state-of-the-art neural parsing models (Kitaev and Klein, 2018;Kiperwasser and Goldberg, 2016b;Cross and Huang, 2016;Chen and Manning, 2014). However, we propose a very simple decoder architecture that relies only on a recursive attention mechanism of the encoded latent representation. In other words, the decoder does not require state encoding and model-optimal inference whatsoever. Our novel model achieved a macro-averaged F1-score of 0.753 in labeled primary edges and 0.864 in unlabeled primary edge prediction on the test set. The results confirm the suitability of our proposed model to the semantic parsing task.

Related work
Leveraging parallels between UCCA and known approaches for syntactic parsing, Hershcovich et al. (2017) proposed TUPA, a customized transition-based parser with dense feature representation. Based on this model, Hershcovich et al. (2018a) used multitask learning effectively by training a UCCA model along with similar parsing tasks where more training data is available, such as Abstract Meaning Representation (AMR) (Banarescu et al., 2013) and Universal Dependencies (UD) (Nivre et al., 2016). Due to Each v i represents the context embedding for each word i from the BiLSTM encoder. Words on edges represent category labels between nodes, where A is participant and P is process. Circles represent nodes in the graph, each with a pair in indices. Circles with 0 as the first index are terminal nodes, and circles with 1 as the first index are non-terminal nodes. (1). Dashed green lines represent the attention mechanism for the word Carey, which forms a continuous proper noun "Mariah Carey". (2). Dashed red lines represent the attention mechanism for the word down, which forms a discontinuous unit "turned ... down". (3). Dotted blue lines represent the attention mechanism for node 1.4 . The darker the color, the higher the attention score. the requirements of reentrancy, discontinuity, and non-terminals, other powerful parsers were shown to be less suitable for parsing with UCCA (Hershcovich et al., 2017).

Parsing Model
BiLSTM models are capable of providing feature representations with sequential data, and attention mechanisms (Vaswani et al., 2017) have been applied successfully to parsing tasks (Kitaev and Klein, 2018). Inspired by their success, our model uses a BiLSTM encoder and a self-attention decoder. The encoder represents each node (terminal and non-terminal) in the DAG without the need to encode features and the current parser state. The proposed decoder takes the encoded representation as the configuration and uses attention mechanism. Without any additional feature extraction, it serves a similar role as an oracle and a transition-system in transition-based parsers. We jointly train a label prediction model and a discontinuity prediction model. We predict remote edges with a different encoder. An example of the parsing model can be seen in Figure 1.

Terminal Nodes
To mitigate sparsity due to the small amount of training data available, we concatenate part-ofspeech tags embeddings to word embeddings in terminal nodes. In addition, because the connections between terminal nodes and non-terminal nodes often require identification of named enti-ties, we also added entity type and case information as additional knowledge. Given a sentence x = x 1 , ..., x n , the vector for each input token is thus represented as where case i is 1 if the first character of the word is capitalized and 0 otherwise. We use pretrained word embeddings from fastText 1 for emb(x i ). POS tags and entity types are predicted using external models 2 and are provided in the training corpus. Each word representation from the encoder is v i = BiLST M (u i ). We assign these contextual word embeddings as vectors to terminal nodes.

Non-terminal Nodes
For non-terminal nodes with only one terminal node as the child, the representation is the same as its corresponding terminal node, i.e. a contextual word embedding from the BiLSTM encoder. For other non-terminal nodes that have more than one terminal children or non-terminal children (i.e. represent more than one word in the text), we use a span representation. Following Cross and Huang (2016), we represent the span between the words .., f n and b 0 , ..., b n are the output of the forward and backward directions in the BiLSTM, respectively. However, the linear subtractions from a nonlinear recurrent neural network (RNN) as a span approximation is not intuitive. Instead, we experimented with an additional BiLSTM on the target span x i , x i+1 , ..., x j , similar to the recursive tree representations in (Socher et al., 2013;Kiperwasser and Goldberg, 2016a) but replaced the feed-forward network with an LSTM. In our experiments with the small dataset in the closed track of the English domain-specific track, this method did not result in improved performance.

Attention Mechanism For Decoding
Our basic decoding model is inspired by the global attention mechanism used in machine translation. The attention averages the encoded state in each time step in the sequence with trainable weights (Luong et al., 2015). We set a maximum sequence length and calculate the attention weights (in probability) for the left boundary index of the span given the node representation v i,j (i ≤ j): (1) where M LP is a multilayer perceptron and h span is of size (1, max sequence length). We choose arg max i p lef t boundary as the index of the left boundary of the predicted span. Let j l denote the index of the left most child of the node j (for example, in Figure 1, j l for node 1.5 is 1 and j l for node 1.6 is 6) 3 . If i ≥ j l , then the node attends to itself to indicate that a span cannot be created yet (as is the case for node 1.6 in Figure 1). Otherwise, there is a span that forms a semantic unit and we need to create a parent node. For example, i = 1 for the node 1.4 , so we create a new node 1.5 which connects the nodes within the span [1: 5], i.e. node 1.1 , node 1.3 , and node 1.4 . We do this recursively to attend to a previous index until the node attends to itself. Then we repeat the procedure on the next word in the sequence. The illustration is shown in Figure 1 with dotted blue lines. The algorithm is presented in Algorithm 1 below. primary parent indicates the parent node to which the current node is not a remote child (in the DAG setting, a child node may have multiple parents). We set the maximum number of recurrence to be 7 to prevent excessive node creation during inference.
Despite its simplicity, there are two limitations to this method. One is the restriction of the maximum sequence length. The other is the distinction 3 For simplicity, word indices start at 1 in the Figure  i = primary parent(v iattn ) l 8: end for between the indices and the actual words in each sentence. The model may cheat during training to attend to specific indices regardless of the actual words in these indices.
Motivated by the success of biaffine attention (Dozat and Manning, 2016) and self-attention models (Vaswani et al., 2017), we replace the index attention decoder with a multiplication model where we can leverage fast optimized matrix multiplication. Similar to the left most child, let j r denote the index of the right most child of node j .
where v is the output from the encoder of size (sequence length, batch size, hidden size). The scoring function is defined as: Compared to the index attention decoder above, this decoder considers both the index and the span representation and thus is more flexible and robust to new texts. The recurrence call remains the same by replacing line 5 and 6 in Algorithm 1 with equations 3 − 6.

Label Prediction
Contextual information is important to label prediction. For instance, in the sentence "It announced Carey returned to the studio to start ... ", the phrase "Carey returned to the studio" should be labeled as a participant (A) instead of a scene (H) according to the context. Ideally the encoder will capture the information from the whole sentence so that we only need the current span to predict its label (since the span has the context information from both sides). However, as shown in previous research with RNN models, the contextual information is lost for a relatively long sentence. Therefore, similar to the label prediction problem with dependency parsers, we use a MLP to predict the label of a span v i,j given its context p = primary parent(v i,j ).
We also experimented with only using span representation as seen in constituency parsing (Gaddy et al., 2018) by replacing (p•v i,j ) with v i,j in equation 7. Surprisingly, this increased the F1 score on the development set by 1.4 points. We conjecture that this is due to the limited amount of training data, which makes it more difficult to learn noisier representations.

Discontinuous Unit
After finding the left boundary of the current span unit as shown in section 3.3, we use two MLPs for binary classification to check (1) if the span forms a proper noun with which we need to combine multiple terminal nodes to one non-terminal node (as "Mariah Carey" in Figure 1) and (2) if the span forms a discontinuous unit (such as "turn ... down" in Figure 1).
If the node span attends to a node in the left and the model predicts a proper noun, we will create a non-terminal node and links all the terminal nodes i, i + 1, ..., j as its terminal children (shown as dashed green lines in Figure 1).
If the model predicts that the span is a discontinuous unit, instead of connecting all the terminal nodes as its children, the new created node only connects node i and node j , and do the recurrence checks afterwards as shown in Algorithm 1 (illustrated as dashed red lines in Figure 1).

Remote Edges
We predict remote edges the same way as the matrix multiplication decoder for primary edges. We use a different BiLSTM encoder to learn representations and avoid confusion between attention to primary edges and remote edges.

Training and Inference
During training, node i attends to the left most child of its primary parent (node p ) recursively until node p is not the left most child of node p 's parent. Because a span representation contains information from both left to right and right to left, node i with the highest attention score not only contains the embedding of its terminal node, but also the span between index i and j in the text. We use cross entropy loss to jointly train for embeddings, the BiLSTM encoder, and the decoder. For inference, we take the output of each token in the text from the BiLSTM encoder as input and create a non-terminal node for each terminal node. We create a new node when the token embedding attends to a different token outside of the current span boundary. The recurrence algorithm for each newly created non-terminal node shown in Algorithm 1 is applied.

Experiments
For the encoder, we use a 2-layer, 500 dimensional BiLSTM with 0.2 dropout. The word embedding size is 300 with feature embedding size of 20 each (pos tagging, entity type, and case information). We use Adam optimizer (Kingma and Ba, 2014) with β 2 set to 0.9 as suggested by Dozat and Manning (2016). Development set is used for early stopping. Because of the small dataset (4113 training sentences), the model overfits after 4 epochs. Table 1 provides the results on the development set and Table 2 shows the results on the test set. official shows results of the model we submitted to the competition with a maximum recursion number of 5 and a β 2 = 0.99. We obtained higher scores by increasing the recursion limit as in section 3.3 (+ max revur = 7), using current span only  Table 2: F1 score on primary and remote edges reported on the test set as explained in section 3.4 (+ child pred), changing β 2 as shown in section 5 (+ β 2 = 0.9) and fixing minor bugs (+ bug fix) incrementally. baseline shows the results of the baseline model (TUPA) from Hershcovich et al. (2017). final shows the results of the model fine-tuned on the development set mentioned in Table 1.

Results
Since there are normally 0 or 1 remote edges in each sentence in the training corpus, the remote edge prediction model is not as effective. Still, the model captures some remote relations. For example, in the sentence "Additionally, Carey's newly slimmed Figure began to change, as she stopped her exercise routines and gained weight", the node "gained weight" is predicted to point to "Carey" where the target annotated remote child is "she". Discontinuous unit prediction also suffers from the problem of insufficient training samples.

Conclusion
This paper describes the system that the UC Davis team submitted to SemEval 2019 Task 1. We propose a recursive self-attention decoder with a simple architecture. Our model is effective in UCCA semantic parsing, ranking third in the close track in-domain task with modest fine-tuning, highlighting the suitability of our approach.