JBNU at MRP 2019: Multi-level Biaffine Attention for Semantic Dependency Parsing

This paper describes Jeonbuk National University (JBNU)’s system for the 2019 shared task on Cross-Framework Meaning Representation Parsing (MRP 2019) at the Conference on Computational Natural Language Learning. Of the five frameworks, we address only the DELPH-IN MRS Bi-Lexical Dependencies (DP), Prague Semantic Dependencies (PSD), and Universal Conceptual Cognitive Annotation (UCCA) frameworks. We propose a unified parsing model using biaffine attention (Dozat and Manning, 2017), consisting of 1) a BERT-BiLSTM encoder and 2) a biaffine attention decoder. First, the BERT-BiLSTM for sentence encoder uses BERT to compose a sentence’s wordpieces into word-level embeddings and subsequently applies BiLSTM to word-level representations. Second, the biaffine attention decoder determines the scores for an edge’s existence and its labels based on biaffine attention functions between roledependent representations. We also present multi-level biaffine attention models by combining all the role-dependent representations that appear at multiple intermediate layers.

One of the main aims of MRP 2019 is to indduce a unified parsing model for different semantic frameworks such that parsing models can be trained using multi-task learning or transfer learning. To enable multi-task learning, we explicitly make shared common components in a neural network architecture across different frameworks. For MRP 2019, we propose a unified neural model for the DM/PSD/UCCA frameworks based on the biaffine attention used in Manning, 2017, 2018; by deploying the sentence encoder part as a "shared" component across these three frameworks. Our system consists of two main components: 1. BERT-BiLSTM sentence encoder (shared across frameworks): Given a sentence, the BERT encoder (Devlin et al., 2019) encodes to its wordpieces and the encoded word piece-level represenations are composed into word-level embeddings based on BiLSTM. Another BiLSTM layer is then applied to the resulting word-level embeddings to create the final sentence representations. We refer to this neural layer for encoding sentences as the BERT-BiLSTM sentence encoder. For multitask learning, the BERT-BiLSTM sentence encoder is shared across all target frameworks.
2. Biaffine attention decoder (frameworkspecific): Role-dependent representations for each word are first induced from the sentence-level embeddings of the BERT-BiLSTM encoder using simple feed-forward layers. Biaffine attention is then performed on the resulting role-dependent representations to predict the existence of an edge and its labels. However, the biaffine attention decoder is not shared but separately trained for each framework. Thus, we have three different biaffine decoders corresponding to DM, PSD, and UCCA.
In addition, our system handles the following specific issues for UCCA parsing and node property prediction: 1. UCCA parsing using biaffine attention To handle UCCA formats using a biaffine attention model, we convert a UCCA graph to a bilexical framework using the semstr tool, which is based on the head rules of UCCA in (Hershcovich et al., 2017). 1 After the biaffine attention is performed, the parsed bilexical graph is converted back to the UCCA format.
2. BiLSTM neural models for node property prediction: In addition to predicting the existence and labels of an edge, the system is required to predict node properties (for DM and PSD). To handle node properties, we further develop property-specific BiLSTMbased neural models. 2 These propertyspecific neural components are designed in a framework-specific manner and are not shared across frameworks.
Furthermore, we present multi-level biaffine attention models, motivated by the multi-level architecture of FusionNet in the machine reading comprehension task .
The preliminary unofficial experiments using our own development seting show that multi-task learning is helpful in improving UCCA's performance, but it does not lead to improvement in performances on the DM and PSD frameworks. 1 We first converted a UCCA MRP format to its xml format and then applied the converter (semstr/convert.py) in semstr to obtain its CoNLL format: https://github.com/danielhers/semstr 2 The node properties required for DM and PSD are a POS tag and a frame. We prepared a BiLSTM neural model for predicting the frame information of a node only, whereas we used the companion data of MRP 2019 to predict POS tags. The remainder of this paper is organized as follows: Section 2 presents our system architecture with details, Section 3 describes the detailed process for training biaffine attention models. Section 4 and 5 provide the preliminary experiment results and the official results at MRP 2019, respectively, and our concluding remarks and a description of future work are given in Section 6. Figure 1 shows the neural architecture based on biaffine attention for bilexical semantic dependency parsing. The neural architecture consists of two components: 1) the BERT-BiLSTM encoder and 2) the biaffine attention decoder. 1) In BERT-BiLSTM encoder, an input sentence is fed to a word representation layer using BERT, resulting in a sequence of word embedding vectors, which are then given to the BiLSTM layer to produce a sentence representation. 2) In biaffine attention, additional feed-forward layers are applied to obtain role-dependent representations for head and dependent roles, which are then forwarded to the biaffine attention.

Word representation layer using BERT
The word representation using BERT uses BiL-STM for composing to word-level embeddings from wordpiece-level embeddings, similar to , which used the average pooling for composition. Specifically, suppose that an input sentence consists of n words, i.e., x 1 · · · x n . To obtain the word representation x i for x i , we use BERT from (Devlin et al., 2019), as shown in Figure 2. An input sentence is segmented into word- pieces and they are fed to the BERT encoder. The resulting output from BERT, which consists of the word pieces in the i-th word are aggregated using BiLSTM, producing w bert i , named BERT wordlevel embedding. 3 The BERT word-level embedding is further combined with the pretrained GloVe word embedding of (Pennington et al., 2014) and part-ofspeech (POS) tag embedding to produce the final word representation, as follows: where e glove i and e P OS i denote the pretrained GloVe word embedding and the POS tag embedding for the i-th word, respectively.

BiLSTM sentence encoding layer
Once word representations are obtained, we further apply BiLSTM to x 1 · · · x n to obtain the following initial hidden representation of the i-th word: where BiLST M i refers to the i-th hidden representation obtained by applying BiLSTM to a given sequence.

Decoder: Biaffine attention
To formulate a decoder using biaffine attention, let BiAf f (x, y) be a biaffine function using the notations of (Dozat and Manning, 2018) and (Socher et al., 2013) as follows: is a vector for the bias term. Our biaffine attention decoder is similar to that of (Dozat and Manning, 2018) and is formulated as follows: where k is the number of node labels, and f is the activation function used in the feed-forward layer F F N . 4 In contrast to the setting of (Dozat and Manning, 2018), the top score s (top) i is newly introduced in our model, where we exploit a simple feed-forward layer for predicting top nodes instead of using an attention method.
Using the score functions of Eq. (1), the prediction results for arcs, labels, and top nodes are formulated as follows: where I(expr) is an indicator function which gives 1 if expr is true and 0 otherwise.

Multi-level Biaffine attention
We also investigated a multi-level biaffine attention, whose information flow is described in more abstract high-level representation. In the task of semantic graph parsing, predicting an arc and a label may be resolved not just by single-level representation but by the combination of various levels of representations; For example, predicting an arc between two deep semantic subgraphs (with high depths) may require more abstract representations for those graphs than the case of predicting an arc between two shallow semantic subgraphs (with low depths). The multi-level biaffine attention is based on the fusion of all the role-dependent representations across levels. 5 This type of multi-level attention is different from deep biaffine attention of (Dozat and Manning, 2018), which uses only single roledependent hidden representation at the final level.
To formulate the multi-level biaffine attention, we first apply deep BiLSTM encoder of L-levels to a list of word embeddings x 1 , · · · , x n as follows.
where r i,l is the hidden representation of the BiL-STM at the l-th layer.
The role-dependent representation for each l-th layer is formulated as follows: 5 A pair of syntactic roles in role-dependent representations are considered -head-dependent roles (or predicateargument roles).
To aggregate all the role-dependent representations, we use the fusion function, denoted as o = f usion(x, y), as defined in (Hu et al., 2018): where is element-wise multiplication. For notational simplicity, we further define sf u(x, y, z), the fusion function that takes three arguments, as follows: Applying the sf u function results in the compositional role-dependent representations z Similar to the arc scores in Eq. (3), we straightforwardly define multi-level terms related to label scores such as h

Property prediction based on BiLSTM
To predict frame information, which is one of the node properties in DM and PSD, we use a simple BiLSTM architecture with a single output layer that generates a node property for each word. 6 Different from the biaffine attention model, the property predictor does not use BERT but a simple word representation that consists of the pretrained GloVe and the POS tag embedding as follows: For encoding a sentence, another BiLSTM is then applied to the sequence of word representations, as follows: Here, words (or tokens) correspond to nodes in a semantic graph.

99
The output layer uses the following simple affine transformation: The loss function uses the cross entropy, which is formulated given a single training sentence as follows: where g(i) is the gold property value of the i-th word and sof tmax k is the function of k-th element of softmax values. 7

Preprocessing
We use word tokens and their POS tags in the companion dataset provided by MRP 2019. To perform UCCA parsing using biaffine attention, conversion between UCCA and bilexical formats is required. For the conversion, we use the semstr tool, which is based on the head rules defined in (Hershcovich et al., 2017).

Multi-task learning on a single framework
In each semantic graph framework, the biaffine attention models consist of three subtasks -edge detection, edge labeling, and top node prediction. We jointly train the neural components of all the subtasks for each framework in the multi-task learning setting using the following combined loss function: where L (edge) , L (label) , and L (top) are the loss functions for edge detection, edge labeling, and top node prediction, respectively, and λ i is the weight for each loss function. However, the property predictor of Section 2.4 is not jointly trained on a single framework because its neural components can be shared in any component in the biaffine attention models.

Multi-task learning across frameworks
To enable multi-task learning across frameworks, we share the BERT-BiLSTM encoder as a common neural component across three frameworks and use framework-specific models for the biaffine attention decoder. Our approach to multi-task learning is similar to that of SHARED1 of (Peng et al., 2017).
In multi-task learning, we alternate training examples for each framework using the frameworkspecific loss function of Eq. (6) such that, over each epoch, all the training examples across the three frameworks are fairly fed without bias to a specific framework.

Hyperparameters
We used Adam optimizer (Kingma and Ba, 2015) to train our biaffine attention models.  In this section, we present the preliminary experimental results, which compare variants of our models. To perform the preliminary experiment, we randomly split the MRP 2019 dataset into training and development sets. Table 2 shows the statistics of training and development sets for the three frameworks. The evaluation measures are unlabeled dependency F1 scores (UF), labeled dependency F1 scores (LF), and top node prediction accuracy (Top). We report the evaluation metrics for the development sets.

Experimental results
We evaluated the following four biaffine attention methods: 1. Biaffine: This model is the baseline biaffine attention model based on the BiLSTM sentence encoder without using BERT.
2. BERT+Biaffine: This model uses the BERT-BiLSTM encoder of Section 2.1 and the biaffine attention model of Section 2.2.
3. BERT+Multi-level Biaffine: This model uses BERT-BiLSTM encoder of Section 2.1 and uses the multi-level attention method of Section 2.3.

BERT+Biaffine+MTL:
This model is the same as BERT+Biaffine but uses the multitask learning across frameworks described in Section 3.3. Table 3 shows the UF, LF, and Top on the three semantic graph frameworks, comparing the four variants of biaffine attention models. BERT+Biaffine performs better than Biaffine, in particular, obtaining the increases of about 5% for UF and LF on the UCCA framework. However, BERT+Multi-level Biaffine does not achieve any further improvements with respect to Biaffine, often yielding weak performances similar to that of the BERT-Biaffine model on the PSD and UCCA frameworks.
BERT+Biaffine+MTL only achieves small improvements on UCCA framework whereas no improvements on DM and PSD frameworks can be observed.
A statistically insignificant improvement for multi-task learning in BERT+Biaffine+MTL was similarly reported in the results of SHARED1 in (Peng et al., 2017). These results imply that instead of naively using the shared encoder only, other advanced multi-task learning approaches such as placing task-specific encoding, as detailed in (Peng et al., 2017), need to be considered.

Official Results
Given the preliminary results, we chose the basic biaffine model "BERT+Biaffine" of Table 3 for the final submission to MRP 2019. The official results using BERT+Biaffine are summarized in Tables 4 and 5, which compare the results of ERG (Oepen and Flickinger, 2019) and TUPA (Hershcovich and Arviv, 2019) which were provided by the task organizer. Table  4 shows the performances of the MRP metrics on the three frameworks, whereas Table 5 presents the performances of task-specific metrics using the SDM metrics  and UCCA metric . The SDM metrics use the unlabeled dependency precision/recall/F1 (UP/UR/UF), the labeled dependency precision/recall/F1 (LP/LR/LF), and the unlabeled/labeled exact matches (UM/LM). The UCCA metrics use the unlabeled and labeled arc precision/recall/F1 for primary, remote and all types of arcs. 8 Overall, our system shows better performances over the baseline TUPA's system, except for the results of UCCA metrics. Comparing to ERG which is the top-performing system in MRP metric on DM, our biaffine system shows slightly improved performance over ERG in terms of UF of the SDP metric. Comparing to the published MRP metrics of the best system (i.e. MRP all metric), the performances of our system are about 1.5 percentage point (

Summary and Conclusion
In this paper, we presented the Jeonbuk National University's system based on unified biaffine attention models for DM, PSD, and UCCA frameworks for the MRP 2019 task. We investigated the extensions of the original biaffine models using multi-level biaffine attention and multi-task learning. The preliminary experiment results show that the use of multi-level models and multi-task learning had no effect on MRP performances under our current settings. The statistically insignificant results of multi-task learning imply that there may be some necessary conditions beyond the default setting to meet before multi-task learning with parameter sharing is effective. In this direction, we plan to explore why multi-task learning is not effective in our current experiment, try to postulate reasonable hypothesis that will help clarifying the effect of multi-task learning, and further examine other advanced multi-task learning including the approaches of (Peng et al., 2017). In addition, we would like to examine alternative fusion functions for multi-level affine attention.    Table 5: The official results of task-specific metrics on the three frameworks, comparing ERG (Oepen and Flickinger, 2019), TUPA (Hershcovich and Arviv, 2019), and our system (BERT+Biaffine).