Improving AMR Parsing with Sequence-to-Sequence Pre-training

In the literature, the research on abstract meaning representation (AMR) parsing is much restricted by the size of human-curated dataset which is critical to build an AMR parser with good performance. To alleviate such data size restriction, pre-trained models have been drawing more and more attention in AMR parsing. However, previous pre-trained models, like BERT, are implemented for general purpose which may not work as expected for the specific task of AMR parsing. In this paper, we focus on sequence-to-sequence (seq2seq) AMR parsing and propose a seq2seq pre-training approach to build pre-trained models in both single and joint way on three relevant tasks, i.e., machine translation, syntactic parsing, and AMR parsing itself. Moreover, we extend the vanilla fine-tuning method to a multi-task learning fine-tuning method that optimizes for the performance of AMR parsing while endeavors to preserve the response of pre-trained models. Extensive experimental results on two English benchmark datasets show that both the single and joint pre-trained models significantly improve the performance (e.g., from 71.5 to 80.2 on AMR 2.0), which reaches the state of the art. The result is very encouraging since we achieve this with seq2seq models rather than complex models. We make our code and model available at https://github.com/xdqkid/S2S-AMR-Parser.

incorporated into the training of an AMR parser. However, the widely used pre-trained models such as ELMO  and BERT (Devlin et al., 2019) may not work as expected for building a state-of-the-art seq2seq AMR parser. The reasons are two-fold. On the one hand, previous studies on both seq2seq-based AMR parsing and AMR-to-text generation demonstrate the necessity of a shared vocabulary for the source and target sides (Ge et al., 2019;. Using pretrained models like BERT as pre-trained encoders for AMR parsing, however, will violate the rule of sharing a vocabulary. On the other hand, pretrained models such as BERT are basically tuned for the purpose of representing sentences instead of generating target sequences. According to Zhu et al. (2020), by contrast to using BERT directly as the encoder, a more reasonable approach is to utilize BERT as an extra feature or view BERT as an extra encoder. See Section 5.1 for more detailed discussions on the effect of BERT on AMR parsing.
In this paper, we propose to pre-train seq2seq models that aim to capture different linguistic knowledge from input sentences. To build such pre-trained models, we explore three different yet relevant seq2seq tasks, as listed in Table 1. Here, machine translation acts as the most representative seq2seq task which takes a bilingual dataset as the training data. According to Shi et al. (2016) and Li et al. (2017), a machine translation system with good performance requires the model to well derive linguistic information from input sentences. The other two tasks require auto-parsed syntactic parse trees and AMR graphs as the training data, respectively. It is worth noting that the pre-training task of AMR parsing is in the similar spirit of selftraining (Konstas et al., 2017).
In order to investigate whether various seq2seq pre-trained models are complementary to each other in the sense that they can be learned jointly to achieve better performance, we further explore joint learning of several pre-training tasks and eval-uate its effect on AMR parsing. In addition, motivated by Li and Hoiem (2018), we extend the vanilla fine-tuning method to optimize for both the performance of AMR parsing and response preservation of the pre-trained models. Detailed experimentation on two widely used English benchmarks shows that our approach substantially improves the performance, which greatly advances the state-ofthe-art. This is very encouraging since we achieve the state-of-the-art by simply making use of the generic seq2seq framework rather than designing sophisticated AMR parsing models.
2 Baseline: AMR Parsing as Seq2Seq Learning Seq2Seq Modeling. The encoder in the Transformer (Vaswani et al., 2017) consists of a stack of multiple identical layers, each of which has two sub-layers: one implements the multi-head selfattention mechanism and the other is a positionwise fully connected feed-forward network. The decoder is also composed of a stack of multiple identical layers. Each layer in the decoder consists of the same sub-layers as in the encoder layers plus an additional sub-layer that performs multihead attention to the output of the encoder stack. See Vaswani et al. (2017) for more details.
Pre-Processing: Linearize AMR Graph to Target Sequence. As in van Noord and Bos (2017), we obtain simplified AMRs by removing variables and wiki links. Variables in AMR graphs are only necessary to indicate co-referring nodes and they do not carry any semantic information by themselves. Therefore, AMR graphs are first converted into AMR trees by removing variables and duplicating the co-referring nodes. Then newlines present in an AMR tree are replaced by spaces to get a sequence. Figure 1(c) illustrates the linearization result of the AMR graph in Figure 1(b). Based on the data of sentences paired with linearized AMR graphs, we train a seq2seq model whose outputs are also linearized AMRs.
Post-Processing: Recover AMR Graph from Target Sequence. The output from Transformer is an AMR sequence without variables, wiki-links, and co-occurrent variables. Moreover, the output may contain brackets that do not match, resulting incomplete concepts. To recover its full graph, the post-processing should restore information removed in pre-processing by assigning a unique variable to each concept, pruning duplicated and redundant material, performing Wikification, and restoring co-referring nodes. Meanwhile, it should fix incomplete concepts. We use the pre-processing and post-processing scripts provided by van Noord and Bos (2017). 1

Seq2Seq Pre-training for AMR Parsing
In this section, we first present our single pretraining approach, followed by the joint pretraining approach on two or more pre-training tasks. Then we present our fine-tuning methods.

Single Pre-training
To be consistent with the seq2seq model for AMR parsing, the pre-trained models in this paper are all built on the Transformer. That is, for each pretraining task listed in Table 1, we learn a seq2seq model which will be used to initialize seq2seq model for AMR parsing in the fine-tuning phase. When building the pre-trained models, we merge all the source and target sides of the three pretraining tasks, and construct a shared vocabulary. Moreover, in all the models we share vocabulary embeddings for both the source and target sides.
PTM-MT is a seq2seq neural machine translation (NMT) model which is trained on a publicly available bilingual dataset. According to findings in Goldberg (2019) and Jawahar et al. (2019), the Transformer encoder is strong in capturing syntax and semantics from source sentences, which is helpful to AMR parsing.
PTM-SynPar is a seq2seq constituent parsing model. Building such a model requires a training dataset which consists of sentences paired with constituency parse trees. To construct a silver treebank, we parse the English sentences in the bilingual data for MT by using an off-the-shelf parser. Then we linearize the automatic parse trees to get syntax sequences, as illustrated in Figure 2. Note that in the linearization, we let the output contain the words from the source sentence. The motivation here is to regard parsing as a language generation problem, similar to the idea in Choe and Charniak (2016).
PTM-SemPar is a seq2seq AMR parsing model trained on a silver corpus of auto-parsed AMR graphs. To construct such a corpus, we apply the 1 https://github.com/RikVN/AMR (1) Figure 3: Illustration of the joint pre-training approach.
baseline system of AMR parsing to process the English sentences in the bilingual MT corpus. Then we adopt the linearization process illustrated in Figure 1 to obtain source-target pairs. Finally, we train a seq2seq-based AMR parsing model on the silver corpus that will be used as a pre-trained model.

Joint Pre-training
Intuitively, the above described single pre-trained models can capture linguistic features from different perspectives. One question is whether these models are complementary when they are properly used to initialize a seq2seq-based AMR parser. To empirically answer this question, we propose to build pre-trained models through jointly learning multiple pre-training tasks. Inspired by the zeroshot approach proposed for multi-lingual neural machine translation (Johnson et al., 2017), we add a unique preceding tag to the target side of training data to distinguish the task of each training instance, as illustrated in Figure 3.
With such tagged training instances, multi-task learning is actually quite straightforward. We simply combine the training data of all the pre-training tasks that we are focusing on and then feed the combined training data to the Transformer model. The training process interleaves training data from each task. For example, we update parameters on a batch of training instances from task1 and then update parameters on a batch of training instances from task2, and the process iterates. With such a joint training strategy, we obtain four joint pre-trained models, i.e., PTM-MT-SynPar, PTM-MT-SemPar, PTM-SynPar-SemPar, and PTM-MT-SynPar-SemPar. Names of the models can tell what pre-training tasks are learned jointly.

Fine-tuning Methods
Given a pre-trained model, we can directly finetune it on a gold AMR corpus to train an AMR parser. For this purpose we use two different finetuning methods. In the following we first present the vanilla fine-tuning method, and then extend it under the framework of multi-task learning. For simplicity, we refer to the latter method as Multi-Task Learning (MTL) fine-tuning hereafter.
Vanilla Fine-Tuning optimizes the parameters of an existing pre-trained seq2seq models to train AMR parsing on a gold AMR corpus. Fine-tuning adapts the shared parameters to make them more discriminative for AMR parsing, and the low learning rate is an indirect mechanism to preserve some of the representational structure captured in the pre-training models.
MTL Fine-Tuning is designed to attack the potential drawback of the vanilla fine-tuning method. In vanilla fine-tuning, optimizing model parameters to train AMR parsing presents a potential risk of overfitting. Inspired by Li and Hoiem (2018), we propose to optimize for high accuracy of AMR parsing while preserving the performance on the pre-training tasks. Preservation of the performance on the pre-training tasks can be regarded as a regularizer for the training of AMR parsing. To implement such MTL fine-tuning, we once again adopt the generic multi-task learning framework depicted in Figure 3. Now the question left behind is how to obtain fine-tuning instances for pre-training tasks. To this end, we use the pre-trained model focused and input sentences of gold AMR corpus to generate finetuning instances for pre-training tasks. Formally speaking, given an instance {s, t (0) } of the finetuning task , and a pre-trained model learned from k pre-training tasks, we first feed the pre-trained model with input s and obtain its k outputs, i.e. t 1 , · · · , t k for the k pre-training tasks, respectively. Therefore, each input s in the fine-tuning task is now equipped with k + 1 outputs, one for the finetuning task while the other k for the k pre-training tasks. Meanwhile, each output is associated with a unique preceding tag which indicates the corresponding task.
Please also note that we do not apply MTL finetuning to the pre-training task of AMR parsing. This is because the fine-tuning task is the same as the pre-training task. For example, for the pretrained model PTM-MT-SynPar-SemPar, in MTL fine-tuning we only keep the pre-training tasks of MT and syntactic parsing.

Experimentation
In this section, we report the performance of our seq2seq pre-training approach to AMR parsing.

Experimental Settings
Pre-training Dataset and Pre-trained Models For pre-trained models, we use the WMT14 English-to-German dataset 2 which consists of about 3.9M training sentence pairs after filtering out long and imbalanced pairs. To obtain syntactic parse trees for the source sentences, we utilize toolkit AllenNLP  which is trained on Penn Treebank (Marcus et al., 1993). To obtain AMR graphs for the source sentences, we utilize our baseline AMR parsing system. Then we merge English/German sentences and linearized parse trees, and AMR graphs together and segment all the tokens into subwords by byte pair encoding (BPE) (Sennrich et al., 2016) with 20K operations.
We implement above pre-trained models based on OpenNMT-py (Klein et al., 2017). 3 For simplicity, we unify parameters of these models as the Transformer-base model in Vaswani et al. (2017). The number of layers in encoder and decoder is 6 while the number of heads is 8. Both the embedding size and the hidden size are 512 while the size of feedforward network is 2048. Moreover, we use Adam optimizer (Kingma and Ba, 2015) with β 1 of 0.9 and β 2 of 0.998. Warm up step, learning rate, dropout rate and label smoothing epsilon are 16000, 2.0, 0.1 and 0.1 respectively. In addition, we set the batch token-size to 8,192. We train the models for 300K steps and choose the model with the best performance on WMT2014 Englishto-German development set as the final pre-trained model.

AMR Parsing Benchmarks
We evaluate AMR performance on AMR 1.0 (LDC2015E86) and AMR 2.0 (LDC2017T10). The two datasets contain 16,833 and 36,521 training AMRs, respectively, and share 1,368 development AMRs and 1,371 testing AMRs. All the source sentences and linearized AMRs are segmented into subwords by using the BPE trained for the pre-trained models.
To fine-tune the pre-trained models for AMR parsing, we follow the settings of hyper-parameters used for training pre-trained models.
Evaluation Metrics For evaluation purpose, we use the AMR-evaluation toolkit to evaluate parsing performance in Smatch and other fine-grained metrics Damonte et al., 2017). We report results of single models that are tuned on the development set. Table 2 presents the comparison of our approach and related studies on the test sets of AMR 1.0 and AMR 2.0. From the results, we have the following observations:

Experimental Results
• Pre-trained models on a single task (i.e., from #2 to #6) significantly improve the performance of AMR parsing, indicating seq2seq pre-training is helpful for seq2seqbased AMR parsing. We also note that the pre-trained model of NMT achieves the best performance, followed by the pre-trained models on AMR parsing and on syntactic parsing. This indicates that seq2seq AMR parsing benefits more from pre-training tasks that require the encoder be able to capture the semantics from source sentences.
• Joint pre-trained models on two or more pre-training tasks further improve the performance of AMR parsing. However, in the presence of NMT pre-training task, the benefits from joint pre-training with either AMR parsing, syntactic parsing or both are shrunk.
• MTL fine-tuning consistently outperforms the vanilla fine-tuning method. For example, on single pre-training tasks, MTL outperforms vanilla fine-tuning by 1.5 ∼ 2.0 Smatch F1 scores while on joint pre-training tasks, the improvements of MTL over vanilla fine-tuning instead decrease.
• With twice training sentences in AMR 2.0, overall the performance on AMR 2.0 is higher than that on AMR 1.0. However, the gap between the performance on AMR 2.0 and AMR 1.0 gets smaller when we move from single pre-training models to joint pre-training models. For example, based on PTM-MT-SynPar-SemPar, the performance gap is 1.1 in Smatch F1 scores, much less than the performance gap 6.9 between their corresponding baselines.
• Finally, our approach achieves the best reported performance on AMR 1.0 and the performance on AMR 2.0 is higher than or close to that achieved by previous studies which use BERT. This is very encouraging taking into consideration the fact that our seq2seq model is much simper than the graph-based models proposed in related studies (Zhang et al., 2019a,b;Naseem et al., 2019;Cai and Lam, 2020). Table 3 compares the performance of our best system and the systems reported recently with finegrained metrics. We obtain the best performance for Reentrancies, NER, and SRL. Compared to the systems of Z'19a, Z'19b, and C'20, we achieve lower performance for Wiki and Negations. One possible reason for our relatively lower performance on Wiki and Negations is that unlike above three systems, in this paper we do not anonymize named entities and do not use an extra algorithm to add polarity attributes.

Analysis and Discussion
In this section, we conduct more analysis on AMR parsing with pre-trained models. In the following all the results are obtained on AMR 2.0.

Effect of BERT on Seq2Seq AMR Parsing
To explore the effect of BERT on seq2seq AMR parsing, motivated by Zhu et al. (2020), we use BERT in various ways to boost the performance of AMR parsing. Given an input sentence x = (x 1 , · · · , x n ) with n words, the BERT tokenizer segments it into a subword sequence x = (x 1 , · · · , x m ) with m     Zhang et al. (2019b), C'20 for Cai and Lam (2020) subwords. Then BERT returns a hidden state se- where d BERT is the size of BERT hidden states (e.g., d BERT =768 in our experiment). Figure 4 illustrates the process of obtaining BERT hidden states for an input sentence. Next we use the following methods to properly incorporate BERT hidden states b into Transformer-based AMR parsing.
• BERT as embedding, which uses f bW B as input of the the Transformer encoder, where W B ∈ R d BERT ×d are model parameters to be learned, d is the model size for seq2seq AMR parsing, and f is the activation function ReLu.
• BERT as encoder, which uses f bW B as the output of the Transformer encoder. That is to say, we replace the Transformer encoder with BERT.
• BERT as extra feature, which views b as extra features for an input sentence x . The input of the Transformer encoder is defined as f [b, (Emb (x ) + P os (x ))]W E , where [·, ·] represents the operation of concatenation, Emb (x ) and P os (x ) return the word embeddings and position embeddings of x respectively, and W E ∈ R (d+d BERT )×d are model parameters to be learned.
• BERT as extra encoder, which adds a sublayer, i.e, BERT-context-attention sub-layer, in the Transformer decoder after the maskedself-attention sub-layer and the contextattention sub-layer. The BERT-contextattention sub-layer works in a similar way as the context-attention sub-layer by attending to BERT hidden states f bW B .  Meanwhile, we also provide another Transformer-based baseline in which we segment input sentences into subwords with the BERT tokenizer. For all above experiments, the source-side vocabulary is the set of subwords in training sentences segmented by the BERT tokenizer while the target-side vocabulary is the set of subwords in training AMRs segmented by BPE mentioned in Section 4.1. Table 4 compares the performance of AMR parsing when incorporating BERT in various methods. By comparing the performance of #1 in Table 4 against the baseline #1 in Table 2, we observe a drop of Smatch F1 score from 71.5 to 70.0, indicating that it is important to share vocabulary for seq2seq AMR parsing. Based on the baseline of not sharing vocabulary, the four different methods of incorporating BERT result in very different performance ranging from 71.5 to 75.2 in Smatch F1 score. Among them, incorporating BERT as embedding or extra feature achieves similar performance, which is much higher than the performance of incorporating BERT as either encoder or extra encoder. This suggests that rather than straightly feeding BERT hidden states into a decoder, it is important to feed them into an encoder first. However, our pre-trained seq2seq models, even on a single pre-training task (i.e., #3, #5, #6) outperform using BERT, indicating the effectiveness of pre-trained seq2seq models for AMR parsing.

Effect of Training Data Sizes on Pre-training Models
In this section we investigate the impact of the size of pre-training data to check whether AMR parsing benefits more from pre-trained models that are trained on larger datasets. To this end, we randomly use 20%, 40%, 60%, and 80% of the full pre-training instances to train the pre-trained models, respectively. As shown in Figure 5, except syntactic pars- ing (i.e., PTM-SynPar), the pre-training models on the other three kinds of pre-training tasks achieve higher AMR parsing performance with the increasing of training data sizes. Based on the learning curve, we suspect there still exists much room for further improvements if we enlarge the training data of pre-training tasks.

Effect of Different Pre-Training Components on Seq2Seq AMR Parsing
When adapt a pre-trained model to AMR parsing, we initialize the whole seq2seq Transformer model of AMR parsing with the counterpart of the pretrained model. However, it is unveiled what part of initialization contributes most. To this end, we decompose the whole seq2seq model into three components, i.e., (shared) word embedding, encoder and decoder. The three components take account of 31.1%, 29.5% and 39.4% of parameters, respectively. Then we do ablation study by accumulating the initialization using the pre-trained model while the other components will be randomly initialized.
We use the PTM-MT-SynPar-SemPar pretrained model as representative (i.e., #14 in Table 2).  bedding, we substantially boost the performance from 71.5 in Smatch F1 score to 78.4 while initializing the other two components with the pre-trained model leads to another 1.8 improvement in Smatch F1 score (i.e., from 78.4 to 80.2).

Effect of Pre-trained Models Trained on Different Datasets
As shown in Table 2, the pre-trained model of PTM-SynPar (or PTM-SemPar) significantly improves the performance AMR parsing from 71.5 to 75.3 (or 77.9) in Smatch F1 score. However, in the presence of PTM-MT, joint pre-training with either PTM-SynPar, PTM-SemPar, or both gives another up to 1.0 improvement, suggesting that complementarity among the pre-trained models does exist but is relatively limited. We suspect that the overlapping is mainly due to the fact that we pre-train these models on the same source-side dataset. We conjecture that more improvement is potentially reachable if the pre-training tasks are trained on different datasets. To test the conjecture, we construct another silver dataset for both syntactic parsing and AMR parsing that is in the same size (i.e., 3.9M) as before. This is done by randomly selecting 3.9M English sentences from WMT14 English monolingual language model training data. 4 Table 6 compares the Smatch F1 scores. From it, we observe consistent improvement if the pre-trained models are jointly trained on different datasets. For example, by replacing the pre-training dataset of AMR parsing with the new constructed dataset, we improve AMR parsing from 80.1 in Smatch F1 score to 81.4. This suggests that assigning different pre-training tasks with different datasets improves the performance of AMR parsing.

Effect of Different Bilingual Datasets
For the pre-training task of machine translation, we have chosen English-to-German (EN-DE) with 3.9M sentence pairs. However, it is still unclear whether it is critical to choose the right language pair. To this end, we move to WMT14 Englilsh-to-French (EN-FR) translation and randomly select 3.9M sentence pairs from its training dataset, as the same size of EN-DE translation. Table 7 compares the Smatch F1 scores when the pre-trained models are trained on different bilingual datasets. From it, we observe that pre-training on EN-FR dataset achieves even slight higher performance than that on EN-DE dataset. This further confirms our finding that AMR parsing can greatly benefit from machine translation.

Related Work
We describe related work from two perspectives: pre-training and AMR parsing.
Pre-training. Pre-training a universal model and then fine-tuning the model on a downstream task have recently become a popular strategy in the field of natural language processing. Previous works on pre-training can be roughly grouped into two categories. One category of approaches is to learn static word embeddings such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) while the other group builds dynamic pre-trained models that would also be used in downstream tasks. Representative examples in the latter group in-clude Dai and Le (2015), CoVe (McCann et al., 2017), ELMo Edunov et al., 2019), OpenAI GPT (Radford et al., 2018), and BERT (Devlin et al., 2019). Besides the aforementioned encoder-only (e.g., BERT) or decoderonly (e.g., GPT) pre-training approaches, recent studies also propose approaches to pre-training seq2seq models, such as MASS (Song et al., 2019), PoDA (Wang et al., 2019), PEGASUS (Zhang et al., 2020), BART (Lewis et al., 2020), and T5 (Raffel et al., 2020).
AMR Parsing. As a semantic parsing task that translates texts into AMR graphs, AMR parsing has received much attention in recent years. Diverse approaches have been applied to the task. Flanigan et al. (2014) pioneer the research work on AMR parsing by using a a two-stage approach: node identification followed by relation recognition. Werling et al. (2015) improve the first stage in the parser of Flanigan et al. (2014) by generating subgraph aligned to lexical items. To avoid conducting AMR parsing from scratch, Wang et al. (2015b) propose to obtain AMR graphs from dependency trees by using a transition-based method. Wang et al. (2015a) extend their previous work by introducing a new transition action to get better performance. Damonte et al. (2017) propose a complete transition-based approach that parses sentences left-to-right in linear time. The recent neural AMR parsing could be roughly grouped into two categories. On the one hand, the generic seq2seq-based approaches have been widely used for AMR parsing which show competitive performance (Peng et al., 2017;van Noord and Bos, 2017;Konstas et al., 2017;Ge et al., 2019). On the other hand, to better model the graph structure on the target side, graph-based models are well studies for AMR parsing which achieve the state-of-theart-performance (Lyu and Titov, 2018;Guo and Lu, 2018;Groschwitz et al., 2018;Zhang et al., 2019a,b;Cai and Lam, 2020).

Conclusion
In this paper we proposed a seq2seq-based pretraining approach to improving the performance of seq2seq-based AMR parsing. To this end, we designed three relevant seq2seq learning tasks, including machine translation, syntactic parsing, and AMR parsing itself. Then we built seq2seq pre-trained models through either single or joint pre-training tasks. Detail experimentation shows that both the single and joint pre-trained models substantially improve our baseline and the performance reaches the state of the art. The accomplishment is encouraging since we achieve this simply by using the generic seq2seq framework rather than complex models.