Rewarding Smatch: Transition-Based AMR Parsing with Reinforcement Learning

Our work involves enriching the Stack-LSTM transition-based AMR parser (Ballesteros and Al-Onaizan, 2017) by augmenting training with Policy Learning and rewarding the Smatch score of sampled graphs. In addition, we also combined several AMR-to-text alignments with an attention mechanism and we supplemented the parser with pre-processed concept identification, named entities and contextualized embeddings. We achieve a highly competitive performance that is comparable to the best published results. We show an in-depth study ablating each of the new components of the parser.


Introduction
Abstract meaning representations (AMRs) (Banarescu et al., 2013) are rooted labeled directed acyclic graphs that represent a non intersentential abstraction of natural language with broad-coverage semantic representations. AMR parsing thus requires solving several natural language processing tasks; named entity recognition, word sense disambiguation and joint syntactic and semantic role labeling. AMR parsing has acquired a lot of attention (Wang et al., 2015a;Zhou et al., 2016;Wang et al., 2015b;Goodman et al., 2016;Guo and Lu, 2018;Lyu and Titov, 2018;Vilares and Gómez-Rodríguez, 2018;Zhang et al., 2019) in recent years.
We build upon a transition-based parser (Ballesteros and Al-Onaizan, 2017) that uses Stack-LSTMs .
We augment training with self-critical policy learning (Rennie et al., 2017) using sentence-level Smatch scores  as reward. This objective is particularly well suited for AMR parsing, since it overcomes the issues arising from the lack of token-level AMR-to-text alignments. In addition, we perform several modifications which are inspired from neural machine translation (Bahdanau et al., 2014) and by the recent trends on contextualized representations (Peters et al., 2018;Devlin et al., 2018).
Our contributions are: (1) combinations of different alignment methods: There has been significant research in that direction (Flanigan et al., 2014;Pourdamghani et al., 2014;Chen, 2015;Chu and Kurohashi, 2016;Chen and Palmer, 2017;Szubert et al., 2018;Liu et al., 2018). In this paper, we show that combination of different methods makes a positive impact. We also combine hard alignments with an attention mechanism (Bahdanau et al., 2014). (2) Preprocessing of named entities and concepts. (3) Incorporating contextualized vectors (with BERT) and compare their effectiveness with detailed ablation experiments. (4) Employing policy gradient training algorithm that uses Smatch as reward.

Stack-LSTM AMR Parser
We use the Stack-LSTM transition based AMR parser of Ballesteros and Al-Onaizan (2017) (henceforth, we refer to it as BO). BO follows the Stack-LSTM dependency parser by . This approach allows unbounded lookahead and makes use of greedy inference. BO also learns character-level word representations to capitalize on morphosyntactic regularities . BO uses recurrent neural networks to represent the stack data structures that underlie many linear-time parsing algorithms. It follows transition-based parsing algorithms (Yamada and Matsumoto, 2003;Nivre, 2003Nivre, , 2008; words are read from a buffer and they are incrementally combined, in a stack, with a set of actions towards producing the final parse. The input is a sentence and the output is a complete AMR graph without any preprocessing required. 1 We use Dynet (Neubig et al., 2017) to implement the parser. In what follows, we present several additions to the original BO model that improved the results.

Label Separation
BO's actions are enriched with labels that may correspond to AMR nodes or labels that decorate the arcs of the graph. BO reported a total of 478 actions in the 2014 dataset. We tried splitting the prediction in two separate steps, first the action, then the label or concept. This reduces the number of actions to 10 and helps the model to drive the search better.

Hard Alignments and Soft Alignments
AMR annotations do not provide alignments between the nodes of an AMR graph and the tokens in the corresponding sentence. We need such alignments to generate action sequences with an oracle for training. The parser is then trained to generate these action sequences. The quality of word-to-graph alignments has a direct impact in the accuracy of the parser.
In previous work, both rule-based (Flanigan et al., 2014) and machine learning (Pourdamghani et al., 2014) methods have been used to produce word-to-graph alignments. Once generated, the alignments are often not updated during training (Flanigan et al., 2016;Damonte et al., 2016;Wang and Xue, 2017;Foland and Martin, 2017).
More recently, Lyu and Titov (2018) learn these alignments as latent variables.
In this work, we combine pre-learned (hard) alignments with an attention mechanism. As shown in section 4, the combination has a synergistic effect. In the following, we first explain our method for producing hard alignments and then we elaborate on the attention mechanism.
Hard Alignments Generation: In order to produce word-to-graph alignments, we combine the outputs of the symmetrized Expectation Maximization approach (SEM) of Pourdamghani et al. (2014) with those of the rulebased algorithm (JAMR) of Flanigan et al. (2014). Pourdamghani et al. (2014) do not produce alignments for all concepts; for example, named-entity 1 We refer interested readers to (Ballesteros and Al-Onaizan, 2017) for details. nodes, date-entity nodes and numerical-quantity nodes are left unaligned. We post-process the output to deterministically align these nodes based on the alignments of its children (if any). We then merge the output with JAMR alignments. Overall, the alignment process involves the following steps: 1. Produce initial alignments using SEM 2 .
2. Fill in the unaligned nodes by upwards percolation of child node alignments. 3 3. Use JAMR alignments 4 for any nodes still unaligned and fill in intermediate nodes again.

Soft Alignments via Attention:
The parser state is represented by the STACK, BUFFER and a list with the history of actions (which are encoded as LSTMs, the first two being Stack-LSTMs ). This forms the vector s t that represents the state: This vector s t is used to predict the best action (and concept to add, if applicable) to take, given the state with a softmax. We complement the state with an attention over the input sentence (Bahdanau et al., 2014). In particular, we use general attention (Luong et al., 2015). In order to do so, we add a bidirectional LSTM encoder to the BO parsing model and we run attention over it in each time step. More formally, the attention weights α i (for position i) are calculated based on the actions predicted so far (represented as a j ), the encoder representation of the sentence (h i ) and a projection weight matrix W a : A vector representation (c j ) is computed by a weighted sum of the encoded sentence word representations and the α values.
Given the sentence representation produced by the attention mechanism (c j ), we complement the parser state as follows: where d j is the concatenation of the output vector of the LSTM with the history of actions LSTM and the output vector of the LSTM that represents the stack. This new vector s t replaces the one described in (1). Those familiar with neural machine translation will recognize that we are using the concatenation of the output of the LSTMs that represent the stack and the action history as the decoder is used in the standard sequence to sequence with attention model (Bahdanau et al., 2014).

Preprocessed Nodes
We produce two types of pre-processed nodes: 1) Named Entity labels (NER) and 2) Concept labels (such as want-01, boy etc.). We use NER labels and preprocessed concepts the same way BO and Dyer et al. (2015) used part-of-speech tags -as another vector concatenated to the word representation and learned during training.
Concepts: AMR representation abstracts away from exact lexical forms. In the case of objects, the concepts are usually represented using the uninflected base forms; for events, the OntoNotes sense number is attached with the base form (such as want-01). We train a linear classifier that uses contextualized BERT embeddings (Devlin et al., 2018) of each word to predict the corresponding concept (which can be none). Each label is predicted in isolation with no regard to the surrounding labels. The tagger is trained using a combination of OntoNotes 5.0 (LDC2013T19) and LDC2017T10 AMR training data.

Named entities:
We extracted named entities from the AMR dataset (there are more than 100 entity types in the AMR language) and we trained a neural network NER model (Ni et al., 2017) to predict NER labels for the AMR parser. In the NER model, the target word and its surrounding words and tags are used as features. We jackknifed (90/10) the training data, to train the AMR parser. The ten jackknifed models got an average NER F1 score of 79.48 on the NER dev set.

Contextualized Vectors
Recent work has shown that the use of pretrained networks improves the performance of downstream tasks. BO uses pre-trained word embeddings by  along with learned character embeddings. In this work, we explore the effect of using contextualized word vectors as pre-trained word embeddings. We experiment with recent context based embedding obtained with BERT (Devlin et al., 2018).
We use average of last 4 layers of BERT Large model with hidden representations of size 1024. We produce the word representation by mean pooling the representations of word piece tokens obtained using BERT. We only use the contextualized word vectors as input to our model, we do not back-propagate through the BERT layers.

Wikification
Given that BO does not produce Wikipedia nodes during prediction, we pre-process the AMR data removing all Wikipedia nodes. In order to produce Wikipedia entries in our AMR graphs, we run a wikification approach as post-processing. We combine the approach of Lyu and Titov (2018) with the entity linking technique of Sil et al. (2018).
First, we produce a dictionary of Wikipedia links for all the named entity nodes that appear with :wiki label in the training data. If a node appears with multiple Wikipedia links, the most frequent one is added to the dictionary. Separately, we also process the target sentence using the entity linking system of Sil et al. (2018). This system identifies the entities as well as links them.
During post processing, every node with :name label is looked up in the dictionary and if found, is assigned the corresponding Wikipedia link. This is very similar to the approach of Lyu and Titov (2018). If the node is not found in the dictionary, and the system of Sil et al. (2018) produces a Wikipedia link, we use that link.

Smatch Weighting
The upper bound for BO's oracle is only 93.3 F1 for the entire development set. We observed that the oracle produces a score close to perfect for most sentences, yet it loses some points in others. During training, we have the gold AMR graph available for every sentence. We compare it to the oracle graph and use the Smatch score as a  weight for the training example. This is a way to down-weight the examples whose oracle actions sequence is incomplete or erroneous. This modification resulted in moderate gains (see row 14 in Table 1) and also lead to the training with exploration experiments described below.

Reinforcement Learning
BO relies on the oracle action sequences. The training objective is to maximize the likelihood of oracle actions. This strategy has two drawbacks. First, inaccurate/incomplete alignments between the tokens and the graph nodes.(As mentioned above, the oracle upper bound is only 93.3 F1. With the enhanced alignments, BO reported 89.5 F1 in the LDC2014 development set). Second, even for the perfectly aligned sentences, the oracle action sequence is not the only or the best action sequence that can lead to the gold graph; there could be shorter sequences that are easier to learn. Therefore, strictly binding the training objective to the oracle action sequences can lead to sub-optimal performance, as evidenced in (Daumé III and Marcu, 2005;Daumé III et al., 2009;Nivre, 2012, 2013;Ballesteros et al., 2016) among others.
To circumvent these issues, we resort to a Reinforcement Learning (RL) objective where the Smatch score of the predicted graph for a given sentence is used as reward. This alleviates the strong dependency on hard alignment and leaves room to training with exploration of the action space. This line of work is also motivated by Goodman et al. (2016), who used imitation learning to build AMR parses from dependency trees.
We use the self-critical policy gradient training algorithm by Rennie et al. (2017) which is a special case of the REINFORCE algorithm of Williams (1992) with a baseline. This method allows the use of an external evaluation measure as reward (Paulus et al., 2017). In particular, we want to maximize the expected Smatch reward, where p θ is the policy specified by the parser parameters θ and g s is a graph sampled from p θ . The gradient of this objective can be approximated using a single sample from p θ . For each sentence, we produce two graphs using the current model parameters. A greedy best graphĝ and a graph g s produced by sampling from action space. The gradient of 8 is approximated as in (Rennie et al., 2017), ∇ θ L RL = (r(g s ) − r(ĝ))∇ θ log(p θ (g s )) (9) where r(g) is the Smatch score of graph g with respect to the ground truth. The Smatch of the greedy graph r(ĝ) serves as a baseline that can reduce the variance in the gradient estimate (Williams, 1992).
With ǫ probability, we flatten the sampling distribution by calculating the square root of the probabilities. In our experiments, ǫ is set to 0.05. We first train our full model with the maximumlikelihood objective of BO that achieves an F-score 72.8 without beam search when evaluated in the development set. The RL training is then initialized with the parameters of this trained model. For RL training, we use a batch-size of 40.

Experiments and Results
We start by reimplementing BO 5 and we train models with the most recent dataset (LDC2017T10) 6 . We include label separation in our reimplementation (Experiments 1..16) which separates the prediction of actions and labels in two different softmax layers. All our experiments use beam 10 for decoding and they are the best (when evaluated in the development set) of 5 different random seeds. Word, input and hidden representations have 100 dimensions (with BERT, input dimensions are 1024), action and label embeddings are of size 20. Our results are presented in Table 1.
We achieve the best results ever reported in some of the metrics. Unlabeled Smatch (16) by 1 point and SRL by 2 points. These two metrics represent the structure and semantic parsing task. For all the remaining metrics, our parser consistently achieves the second best results. Also, our best single model (16) achieves more than 9 Smatch points on top of BO (0). Guo and Lu (2018)'s parser is a reimplementation of BO with a refined search space (which we did not attempt) and we beat their performance by 5 points.
The hard alignments proposed in this paper present a clear advantage over the JAMR alignments. BO ignores nodes that are not aligned to tokens in the sentence, and it benefits from a more recall oriented alignment method. Adding attention on top of that adds a point, while preprocessing named entities improve the NER metric. Adding concepts preprocessed with our BERT based tagger adds more than a point. Smatch weighting lead to half a point on top of (14).
BERT contextualized vectors provide more than a point on top of the best model with traditional word embeddings (without attention, the difference is of 2 points). Combining BERT with a model that only sees words (15), we achieve the best results surpassed only by models that also use contextualized vectors and reinforcement learning objective, However, we added Smatch weighting (14) and Reinforcement Learning (16) on top of 13. This was decided based on development data results, where 13 performed better than the BERT only model (15) by about a point.
Finally, training with exploration via reinforcement learning gives further gains of about 2 points and achieves one of the best results ever reported on the task and state of the art in some of the metrics.

Conclusions
We report modifications in a competitive AMR parser achieving one of the best results in the task. Our main contribution augments training with Policy Learning by priming samples that are more suitable for the evaluation metric. We perform an in-depth ablation experiment that shows the impact of each of our contributions. Our unlabeled Smatch score (achieving the best graph structure) suggests that a new strategy to predict labels may reach even higher numbers.