A Transition-Based Algorithm for Unrestricted AMR Parsing

Non-projective parsing can be useful to handle cycles and reentrancy in AMR graphs. We explore this idea and introduce a greedy left-to-right non-projective transition-based parser. At each parsing configuration, an oracle decides whether to create a concept or whether to connect a pair of existing concepts. The algorithm handles reentrancy and arbitrary cycles natively, i.e. within the transition system itself. The model is evaluated on the LDC2015E86 corpus, obtaining results close to the state of the art, including a Smatch of 64%, and showing good behavior on reentrant edges.


Introduction
Abstract Meaning Representation (AMR) is a semantic representation language to map the meaning of English sentences into directed, cycled, labeled graphs (Banarescu et al., 2013).Graph vertices are concepts inferred from words.The concepts can be represented by the words themselves (e.g.dog), PropBank framesets (Palmer et al., 2005) (e.g.eat-01), or keywords (like named entities or quantities).The edges denote relations between pairs of concepts (e.g.eat-01 :ARG0 dog).AMR parsing integrates tasks that have usually been addressed separately in natural language processing (NLP), such as named entity recognition (Nadeau and Sekine, 2007), semantic role labeling (Palmer et al., 2010) or co-reference resolution (Ng and Cardie, 2002;Lee et al., 2017).Figure 1 shows an example of an AMR graph.
Several transition-based dependency parsing algorithms have been extended to generate AMR.Wang et al. (2015) describe a two-stage model, where they first obtain the dependency parse of a sentence and then transform it into a graph.Damonte et al. (2017) propose a variant of the ARC-EAGER algorithm to identify labeled edges between concepts.These concepts are identified Words can refer to concepts by themselves (green), be mapped to PropBank framesets (red) or be broken down into multiple-term/non-literal concepts (blue).
Prince plays different semantic roles.
using a lookup table and a set of rules.A restricted subset of reentrant edges are supported by an additional classifier.A similar configuration is used in (Gildea et al., 2018;Peng et al., 2018), but relying on a cache data structure to handle reentrancy, cycles and restricted non-projectivity.
A feed-forward network and additional hooks are used to build the concepts.Ballesteros and Al-Onaizan (2017) use a modified ARC-STANDARD algorithm, where the oracle is trained using stack-LSTMs (Dyer et al., 2015).Reentrancy is handled through SWAP (Nivre, 2009) and they define additional transitions intended to detect concepts, entities and polarity nodes.This paper explores unrestricted non-projective AMR parsing and introduces AMR-COVINGTON, inspired by Covington (2001).It handles arbitrary non-projectivity, cycles and reentrancy in a natural way, as there is no need for specific transitions, but just the removal of restrictions from the original algorithm.The algorithm has full coverage and keeps transitions simple, which is a matter of concern in recent studies (Peng et al., 2018).
Notation We use typewriter font for concepts and their indexes (e.g.dog or 1), regular font for raw words (e.g.dog or 1), and a bold style font for vectors and matrices (e.g.v, W).Covington ( 2001) describes a fundamental algorithm for unrestricted non-projective dependency parsing.The algorithm can be implemented as a left-to-right transition system (Nivre, 2008).The key idea is intuitive.Given a word to be processed at a particular state, the word is compared against the words that have previously been processed, deciding to establish or not a syntactic dependency arc from/to each of them.The process continues until all previous words are checked or until the algorithm decides no more connections with previous words need to be built, then the next word is processed.The runtime is O(n 2 ) in the worst scenario.To guarantee the single-head and acyclicity conditions that are required in dependency parsing, explicit tests are added to the algorithm to check for transitions that would break the constraints.These are then disallowed, making the implementation less straightforward.

The AMR-Covington algorithm
The acyclicity and single-head constraints are not needed in AMR, as arbitrary graphs are allowed.Cycles and reentrancy are used to model semantic relations between concepts (as shown in Figure 1) and to identify co-references.By removing the constraints from the Covington transition system, we achieve a natural way to deal with them. 1  Also, AMR parsing requires words to be transformed into concepts.Dependency parsing operates on a constant-length sequence.But in AMR, words can be removed, generate a single concept, or generate several concepts.In this paper, additional lookup tables and transitions are defined to create concepts when needed, following the current trend (Damonte et al., 2017;Ballesteros and Al-Onaizan, 2017;Gildea et al., 2018).

Formalization
Let G=(V, E) be an edge-labeled directed graph where: V ={0, 1, 2, . . ., M} is the set of concepts and E = V ×E ×V is the set of labeled edges, we will denote a connection between a head concept 1 This is roughly equivalent to going back to the naive parser called ESH in (Covington, 2001), which has not seen practical use in parsing due to the lack of these constraints.
i ∈ V and a dependent concept j ∈ V as i l − → j, where l is the semantic label connecting them.
The parser will process sentences from left to right.Each decision leads to a new parsing configuration, which can be abstracted as a 4-tuple (λ 1 , λ 2 , β, E) where: • β is a buffer that contains unprocessed words.
They await to be transformed to a concept, a part of a larger concept, or to be removed.In b|β, b represents the head of β, and it optionally can be a concept.In that case, it will be denoted as b.
• λ 1 is a list of previously created concepts that are waiting to determine its semantic relation with respect to b. Elements in λ 1 are concepts.In λ 1 |i, i denotes its last element.
• λ 2 is a list that contains previously created concepts for which the relation with b has already been determined.Elements in λ 2 are concepts.In j|λ 2 , j denotes the head of λ 2 .
• E is the set of the created edges.
Given an input sentence, the parser starts at an initial configuration c s = ([0], [], 1|β, {}) and will apply valid transitions until a final configuration c f is reached, such that c f = (λ 1 , λ 2 , [], E).The set of transitions is formally defined in Table 1: • SHIFT: Pops b from β. λ 1 , λ 2 and b are appended.
• NO ARC: It is applied when the algorithm determines that there is no semantic relationship between i and b, but there is a relationship between some other node in λ 1 and b.
• CONFIRM: Pops b from β and puts the concept b in its place.This transition is called to handle words that only need to generate one (more) concept.
• BREAKDOWN α : Creates a concept b from b, and places it on top of β, but b is not popped, and the new buffer state is b|b|β.It is used to handle a word that is going to be mapped

Transitions
Step t Step t + 1 to multiple concepts.To guarantee termination, BREAKDOWN is parametrized with a constant α, banning generation of more than α consecutive concepts by using this operation.Otherwise, concepts could be generated indefinitely without emptying β.
• REDUCE: Pops b from β.It is used to remove words that do not add any meaning to the sentence and are not part of the AMR graph.
LEFT and RIGHT-ARC handle cycles and reentrancy with the exception of cycles of length 2 (which only involve i and b).To assure full coverage, we include an additional transition: ARCs are marginal and will not be learned in practice.AMR-COVINGTON can be implemented without MULTIPLE-ARC, by keeping i in λ 1 after creating an arc and using NO-ARC when the parser has finished creating connections between i and b, at a cost to efficiency as transition sequences would be longer.Multiple edges in the same direction between i and b are handled by representing them as a single edge that merges the labels.
Example Table 2 illustrates a valid transition sequence to obtain the AMR graph of Figure 1.

Training the classifiers
The algorithm relies on three classifiers: (1) a transition classifier, T c , that learns the set of transitions introduced in §3.1, (2) a relation classifier, R c , to predict the label(s) of an edge when the selected action is a LEFT-ARC, RIGHT-ARC or MULTIPLE-ARC and (3) a hybrid process (a concept classifier, C c , and a rule-based system) that determines which concept to create when the selected action is a CONFIRM or BREAKDOWN.
Architecture We use feed-forward neural networks to train the tree classifiers.The transition classifier uses 2 hidden layers (400 and 200 input neurons) and the relation and concept classifiers use 1 hidden layer (200 neurons).The activation function in hidden layers is a relu(x)=max(0, x) and their output is computed as relu(W i • x i + b i ) where W i and b i are the weights and bias tensors to be learned and x i the input at the ith hidden layer.The output layer uses a softmax function, computed as P (y = s|x) = e x T θs S s =1 e x T θ s .All classifiers are trained in mini-batches (size=32), using Adam (Kingma and Ba, 2015) (learning rate set to 3e −4 ), early stopping (no patience) and dropout (Srivastava et al., 2014) (40%).The classifiers are fed with features extracted from the preprocessed texts.Depending on the classifier, we are using different features.These are summarized in Appendix A (Table 5), which also describes (B) other design decisions that are not shown here due to space reasons.

Running the system
At each parsing configuration, we first try to find a multiword concept or entity that matches the head elements of β, to reduce the number of BREAK-DOWNs, which turned out to be a difficult transition to learn (see §4.1).This is done by looking at a lookup the training set and a set of rules, as introduced in (Damonte et al., 2017;Gildea et al., 2018).
We then invoke T c and call the corresponding subprocess when an additional concept or edgelabel identification task is needed.

Concept identification
If the word at the top of β occurred more than 4 times in the training set, we call a supervised classifier to predict the concept.Otherwise, we first look for a word-toconcept mapping in a lookup table.If not found, if it is a verb, we generate the concept lemma-01, and otherwise lemma.
Edge label identification The classifier is invoked every time an edge is created.We use the list of valid ARGs allowed in propbank framesets by Damonte et al. (2017).Also, if p and o are a propbank and a non-propbank concept, we restore inverted edges of the form o l-of

Methods and Experiments
Corpus We use the LDC2015E86 corpus and its official splits: 16 833 graphs for training, 1 368 for development and 1 371 for testing.The final model is only trained on the training split.
Metrics We use Smatch (Cai and Knight, 2013) and the metrics from Damonte et al. (2017). 3  Sources The code and the pretrained model used in this paper can be found at https:// github.com/aghie/tb-amr.

Results and discussion
Table 3 shows accuracy of T c on the development set.CONFIRM and REDUCE are the easiest transitions, as local information such as POS-tags and words are discriminative to distinguish between content and function words.BREAKDOWN is the hardest action. 4In early stages of this work, we observed that this transition could learn to correctly generate multiple-term concepts for namedentities that are not sparse (e.g.countries or people), but failed with sparse entities (e.g.dates or percent quantities).Low performance on identifying them negatively affects the edge metrics, which require both concepts of an edge to be correct.Because of this and to identify them properly, we use the mentioned complementary rules to handle named entities.RIGHT-ARCs are harder than LEFT-ARCs, although the reason for this issue remains as an open question for us.The performance for NO-ARCs is high, but it would be interesting to achieve a higher recall at a cost of a lower precision, as predicting NO-ARCs makes the transition sequence longer, but could help identify more distant reentrancy.The accuracy of T c is ∼86%.The accuracy of R c is ∼79%.We do  (Flanigan et al., 2016), D (Damonte et al., 2017), P (Peng et al., 2018).D, P and our system are left-to-right transition-based.
not show the detailed results since the number of classes is too high.C c was trained on concepts occurring more than 1 time in the training set, obtaining an accuracy of ∼83%.The accuracy on the development set with all concepts was ∼77%.Table 4 compares the performance of our systems with state-of-the-art models on the test set.AMR-COVINGTON obtains state-of-the-art results for all the standard metrics.It outperforms the rest of the models when handling reentrant edges.It is worth noting that D requires an additional classifier to handle a restricted set of reentrancy and P uses up to five classifiers to build the graph.
Discussion In contrast to related work that relies on ad-hoc procedures, the proposed algorithm handles cycles and reentrant edges natively.This is done by just removing the original constraints of the arc transitions in the original Covington (2001) algorithm.The main drawback of the algorithm is its computational complexity.The transition system is expected to run in O(n 2 ), as the original Covington parser.There are also collateral issues that impact the real speed of the system, such as predicting the concepts in a supervised way, given the large number of output classes (discarding the less frequent concepts the classifier needs to discriminate among more than 7 000 concepts).In line with previous discussions (Damonte et al., 2017), it seems that using a supervised feedforward network to predict the concepts does not lead to a better overall concept identification with respect of the use of simple lookup tables that pick up the most common node/subgraph.Currently, every node is kept in λ, and it is available to be part of new edges.We wonder if only storing in λ the head node for words that generate multiple-node subgraphs (e.g. for the word father that maps to have-rel-role-91 :ARG2 father, keeping in λ only the concept have-rel-role-91) could be beneficial for AMR-COVINGTON.
As a side note, current AMR evaluation involves elements such as neural network initialization, hooks and the (sub-optimal) alignments of evaluation metrics (e.g.Smatch) that introduce random effects that were difficult to quantify for us.

Conclusion
We introduce AMR-COVINGTON, a non-projective transition-based parser for unrestricted AMR.The set of transitions handles reentrancy natively.Experiments on the LDC2015E86 corpus show that our approach obtains results close to the state of the art and a good behavior on re-entrant edges.
Sequential models have shown that fewer hooks and lookup tables are needed to deal with the high sparsity of AMR (Ballesteros and Al-Onaizan, 2017).

Features
Tc Rc Cc From β0, β1, λ1 Table 5: Set of proposed features for each classifier.POS, W, C, ENTITY are part-of-speech tag, word, concept and entity embeddings.EW are pre-trained external word embeddings, fine-tuned during the training phase (http://nlp.stanford.edu/data/glove.6B.zip, 100 dimensions).LM and RM are the leftmost and the rightmost function; and h, c, cc represent head, child and grand-child concepts of a concept; so, for example, LM c stands for the leftmost child of the concept.NH and NC are the number of heads and children of a concept.NPUNKT indicates the number of '.', ';', ':', '?', '!' that have already been processed.HL denotes the labels of the last assigned head.CT indicates the type of concept (constant, propbank frameset, other).G indicates if a concept was generated using a CONFIRM or BREAKDOWN.D denotes the dependency label existing in the dependency tree between the word at the jth position in β and the kth last in λ 1 and vice versa.The word that generated a concept is still accessible after creating the concept.urations are not fed as samples to the classifier.On early experiments it was observed that the BREAKDOWN transition could acceptably learn non-sparse named entities (e.g.countries and nationalities), but failed on the sparse ones (e.g.dates or money amounts).By processing the named entities with hooks instead, the aim was to make the parser familiar with the parsing configurations that are obtained after applying the hooks.
• Additionally, named-entity subgraphs and subgraphs coming from phrases (involving two or more terms) from the training set are saved into a lookup table.The latter ones had little impact.
• We store in a lookup table some single-word expressions that generated multiple-concept subgraphs in the training set, based on simple heuristics.We store words that denote a negative expression (e.g.undecided that maps to decide-01 :polarity -).We store words that always generated the same subgraph and occurred more than 5 times.We also store capitalized single words that were not previously identified as named entities.
• We use the verbalization list from Wang et al.
• When predicting a CONFIRM or BREAKDOWN for an uncommon word, we check if that word was mapped to a concept in the training set.If not, we generate the concept lemma-01 if it is a verb, otherwise lemma.
• Dates formatted as YYMMDD or YYYYM-MDD are identified using a simple criterion (sequence of 6 or 8 digits) and transformed into YYYY-MM-DD on the test phase, as they were consistently misclassified as integer numbers in the development set.
• We apply a set of hooks similar to (Damonte et al., 2017) to determine if the predicted label is valid for that edge.
We forbid to generate the same concept twice consecutively.Also, we set α = 4 for BREAK-DOWN α .
• If a node is created, but it is not attached to any head node, we post-process it and connect to the root node.
• We assume multi-sentence graphs should contain sentence punctuation symbols.If we predict a multi-sentence graph, but there is no punctuation symbol that splits sentences, we post-process the graph and transform the root node into an AND node.

Figure 1 :
Figure1: AMR graph for 'When the prince arrived on the Earth, he was surprised not to see any people'.Words can refer to concepts by themselves (green), be mapped to PropBank framesets (red) or be broken down into multiple-term/non-literal concepts (blue).Prince plays different semantic roles.

Table 1 :
Transitions for AMR-COVINGTON table of multiword concepts 2 seen in SHIFT38Table2: Sequence of gold transitions to obtain the AMR graph for the sentence 'When the prince arrived on the Earth, he was surprised not to see any people', introduced in Figure1.For brevity, we represent words (and concepts) by their first character (plus an index if it is duplicated) and we only show the top three words for λ 1 , λ 2 and β.Steps from 20 to 23(2) and from 28 to 31 manage the reentrant edges for prince (p) from surprise-01 (s) and see-01 (s2).

Table 3 :
T c scores on the development set.