M2L at SemEval-2016 Task 8: AMR Parsing with Neural Networks

This paper describes our contribution to the SemEval 2016 Workshop. We participated in the Shared Task 8 on Meaning Representation parsing using a transition-based approach, which builds upon the system of Wang et al. (2015a) and Wang et al. (2015b), with additions that utilize a Feedforward Neural Network classiﬁer and an enriched feature set. We observed that exploiting Neural Networks in Abstract Meaning Representation parsing is challenging and we could not beneﬁt from it, while the feature enhancements yielded an improved performance over the baseline model.


Introduction
Abstract Meaning Representation (AMR) (Banarescu et al., 2013;Dorr et al., 1998) is a semantic formalism which represents sentence meaning in a form of a rooted directed acyclic graph. AMR graph nodes represent concepts, labelled directed edges between the nodes show the relationships between concepts. The AMR formalism was created in order to explore the semantics behind natural language units for further analysis and application in various tasks.
At the time of writing this paper two AMR parsers are publicly available: graph-based JAMR (Flanigan et al., 2014) and transition-based CAMR (Wang et al., 2015a;Wang et al., 2015b). The latter has served as our baseline model, which we tried to improve by incorporating additional features defined for a wider conditioning context and a neural network (NN) classifier.
Inspired by the results of Chen and Manning (2014) and Weiss et al. (2015), who obtained state-of-the-art results in transition-based dependency parsing using Feedforward Neural Networks (FFNN), and taking into account the transition nature of the CAMR model, we performed a series of experiments in the same direction. Neural networks have been successfully applied to many NLP fields and we were curious to examine their potential in the task of AMR parsing. Specifically, we investigated the possibility of constraining the averaged perceptron algorithm (Collins, 2002) predictions by those of an FFNN at the initial step of the inference process.

Preprocessing
We used the Stanford CoreNLP v3.6.0 toolkit  to get named entity (NE) and dependency information, the latter in the form of Stanford dependencies was obtained from the NN dependency parser (Chen and Manning, 2014).
We used a publicly available semantic role labelling (SRL) system with a predicate disambiguation module (Björkelund et al., 2009). The system is a part of MATE tools 1 , which also include a lemmatizer, a part of speech (POS) tagger, and a dependency parser. We use them to obtain lemmas and POS tags. All the tools were used with the pretrained models.
Using MALT dependencies instead of or in tandem with Stanford dependencies did not change the overall performance of the AMR parser. Due to time limitations, we did not perform a full analysis of the accuracy of both tools, which would be an interesting and important point to investigate further. All information extracted after the preprocessing step was combined into one CoNLL-format file (Hajič et al., 2009). Finally, we used an AMR graphto-sentence aligner of Flanigan et al. (2014) to map word spans to concept fragments in the AMR graph.

Parsing Algorithm
We use the same set of transitions and a parsing algorithm as Wang et al. (2015b). We skip the full description due to space limitations and refer to the original paper. It forms a quadruple Q = (S, T, s 0 , S t ), where S is a set of parsing states (or configurations), T is a set of parsing transitions (or actions), s 0 is the initial state and S t is a set of terminal states of the parser. Each state is a triple (σ, β, G), where σ denotes a stack, storing indices of the nodes which have not been processed yet; it's top element is σ 0 . β is a buffer, storing the children of σ 0 . Finally, G is a partially built AMR graph aligned with the sentence tokens.
At the beginning of the parsing procedure, σ is initialized with a post-order traversal of the input dependency tree with topmost element σ 0 ; β is initialized with σ 0 's children or set to null if σ 0 is a leaf node. G is initialized with the nodes and edges of the dependency tree, but all node and edge labels are set to null. The parser processes all nodes and their outgoing edges in the tree in a bottom-up leftright manner, applying some transition to the current node or edge. Parsing terminates when both σ and β are empty.
There are nine basic transitions (NEXT-EDGE, SWAP, REATTACH, REPLACE-HEAD, REEN-TRANCE, MERGE, NEXT-NODE, DELETE-NODE, INFER), some of which result in assigning either a concept label or an edge label. The sets of concept labels for aligned nodes, unaligned nodes and edge labels (S aligned tag , S unaligned tag , S edge , respectively) are constructed during the preprocessing stage and are later used to provide candidate concept tags or edge labels for the respective transitions. Let's also introduce a set S total = S aligned tag S unaligned tag S edge . |S total | determines the number of classes for our classification algorithm to choose from.
We propose a parsing strategy which first uses an NN classifier to constrain the space of candidate transitions by choosing an unlabelled version of a transition, and then forces the perceptron algorithm to make a prediction only on the label of the chosen transition. We have also tried to completely substitute the averaged perceptron algorithm with an NN, experimenting with various types and architectures of the latter, but experimental results were unsatisfactory. This question is yet to be analysed in full, but presumably we failed to provide our NN classifiers with a sufficiently rich input representation: we used a very simple technique of concatenating embedding vectors (Section 4), which apparently did not capture enough information for an NN to make accurate predictions.   is the head of σ0 in the dependency tree; β head+ 0 is the node, to which β0 could be attached to as a result of either reentrance or reattachment transition; t is a transition under consideration -during feature extraction we consider the tag which might be assigned as a result of transition (t.tag) or whether this candidate is a verb sense tag (t.tag is verb ). The isnom feature checks whether the lemma of a corresponding element is in the NomBank dictionary. nech is the number of an element's children which have an NE label from the set {"PERSON", "LOCATION", "ORGANIZATION", "MISC"}. Other features are self-explanatory.
Each of them is mapped to the same D-dimensional vector space (e w i , e l i , e p i , e n i , e d i ∈ R D ). We create one embedding matrix E ∈ R |V |×D , where |V | is the total size of the vocabulary and D is the dimensionality of dense embedding vectors.
In each parsing configuration we consider a set of elements which might be useful in the prediction task. These elements are σ 0 , β 0 and their neighbouring nodes, which for σ 0 include σ p 0 (the parent of σ 0 in the dependency tree), σ lsb 0 (the left sibling), σ rsb 0 , σ rsb2 0 (the first and the second right siblings), σ prs1 0 , σ prs2 0 (the first and the second previous tokens in the sentence); neighbouring nodes for β 0 are defined in a similar manner. Thus, the total number of relevant elements is n elem = 14.
The overall architecture of the network is depicted in Figure 1. The input layer of the network consists of two components. The first one is formed by concatenating all the corresponding embedding vectors for each element's feature (Figure 1 (1)). Each of the x components is, in turn, a concatenation of the embeddings of the configuration elements for a particular type of annotation. For example, x w is an R N vector, where N = d × n elem . We form x l , x p , x n , x d in a similar manner and concatenate x w , x l , x p , x n , x d to form the input vector h E 0 . The second component of the input vector is h I 0 ( Figure  1 (2)), a concatenation of vectors, representing nonembedding numerical features (see Table 1).
We separately map two parts of the input layer to hidden layers using the tanh activation function: h E 0 to h E 1 , h I 0 to h I 1 (Figure 1 (3) and (4)) . We then con-catenate layers h E 1 and h I 1 (Figure 1 (5)) and pass the resultant vector to the last hidden layer h 3 , applying the tanh function again (Figure 1 (6)). Finally, a softmax layer is added on top of h 3 in order to calculate probabilities of the output classes (Figure 1  (7)).

Feature Sets
We have designed two separate feature sets for the NN and perceptron classifiers. The feature set for the latter is roughly the same as in (Wang et al., 2015b) (Table 1). Following the authors of CAMR, we also make use of the NomBank 1.0 dictionary (Meyers et al., 2004) 2 . Unfortunately, we could not obtain the copy of the same SRL system which was used by the authors. Therefore, we also measure accuracy improvement from incorporating the semantic features defined in the original paper but extracted after processing the data with a different SRL system (marked with a ).
We also measure the improvement from incorporating the features extracted from a wider configuration context -they were not included into the baseline model and are marked with a •.
In the case of the NN classifier we follow a standard feature extraction procedure and discard transition-specific features. Apart from the embedding features, we also include a number of numerical features, which proved to be useful in our exper-iments.

Training Procedure and Parsing Policy
We trained the perceptron with a weight-averaging procedure, described in (Collins, 2002).
For the NN classifier we first prepared a training set {(x i , y i )} n i=1 , where x i denotes a feature representation of configuration i and y i is the unlabelled version of the correct transition. The training objective is to minimize L2-regularized negative loglikelihood of the model: We randomly initialized the embeddings within (−0.01, 0.01) and fixed their dimensionality at 32. All weight matrices were initialized using the normalized initialization technique of Glorot and Bengio (2010). Hidden layer sizes were fixed as follows: All the biases were set to 0.02.
We train our network using mini-batches of size 200 under the early stopping settings with RM-Sprop 3 as an optimization algorithm. The learning rate of 5 × 10 −4 and λ = 1.5 × 10 −5 were found to perform best on the validation set.
Given a feature representation of the current configuration s, the parsing algorithm first provides a pool of legal transitions T ∈ S total , |T | |S total |. We compute the probabilities of nine unlabelled transitions using the NN. If the network is confident about its prediction (we set an empirically chosen probability threshold of 0.9), we choose the highest scoring candidate and force the perceptron to predict the label of the transition, if it is a transition which assigns a concept tag or edge label. Each candidate t ∈ T is being scored by a linear scoring function score(t, s) = ω · φ(t, s), where ω denotes a weight vector for a particular candidate transition and φ(t, s) is a feature function mapping a (t, s) pair to a real-valued feature vector. The best scoring transition is chosen and applied to the configuration. If the NN chooses a transition which does not assign 3 http://www.cs.toronto.edu/˜tijmen/ csc321/slides/lecture_slides_lec6.pdf  labels, we apply this transition without asking for the perceptron prediction. Finally, if the network is not confident about the prediction -that is, if the probability for the highest scoring candidate is lower than 0.9 -we disregard the NN prediction and choose the prediction given by the perceptron algorithm.

Experiments
All the experiments were performed on the LDC2015E86 dataset, provided by the organizers. In our experiments we followed the standard train/dev/test split (16, 833, 1, 368 and 1, 371 sentences, respectively). Parser performance was evaluated with the Smatch  scoring script v2.0.2 4 ( Table 2).
As expected, using SRL features resulted in better performance compared to the baseline model (roughly a 2 F 1 points gain). Conditioning on a wider context was also beneficial -widening the context to include more configuration elements is often a good feature expansion technique (Toutanova et al., 2003;. In contrast to our expectations, the NN classifier did not improve the parser performance. This might be due to the higher complexity of the AMR parsing task or the peculiarities of the underlying parsing algorithm (as mentioned in Section 3, we discarded some actionspecific features due to the difficulty of their integration into the NN model). Further investigation on this matter is required to draw ground conclusions.

Conclusion
We have performed a range of experiments which resulted in improving the performance of the baseline AMR parsing system. The results show that a richer feature set is very likely to lead to more accurate predictions. Unfortunately, our attempts to further im-prove the system using NN were not that successful. This goes against the hypothesis that a small number of dense vector embedding features are sufficient to capture the information necessary for accurate inference, which in traditional approaches is achieved by using a large amount of sparse hand-crafted features (Chen and Manning, 2014). The obtained results will be used in our further investigation on this matter.