SemEval-2016 Task 8: Meaning Representation Parsing

In this report we summarize the results of the SemEval 2016 Task 8: Meaning Representation Parsing. Participants were asked to generate Abstract Meaning Representation (AMR) (Banarescu et al., 2013) graphs for a set of English sentences in the news and discussion forum domains. Eleven sites submitted valid systems. The availability of state-of-the-art baseline systems was a key factor in lowering the bar to entry; many submissions relied on CAMR (Wang et al., 2015b; Wang et al., 2015a) as a baseline system and added extensions to it to improve scores. The evaluation set was quite difﬁcult to parse, particularly due to creative approaches to word representation in the web forum portion. The top scoring systems scored 0.62 F1 according to the Smatch (Cai and Knight, 2013) evaluation heuristic. We show some sample sentences along with a comparison of system parses and perform quantitative ablative studies.


Introduction
Abstract Meaning Representation (AMR) is a compact, readable, whole-sentence semantic annotation (Banarescu et al., 2013). It includes entity identification and typing, PropBank semantic roles (Kingsbury and Palmer, 2002), individual entities playing multiple roles, as well as treatments of modality, negation, etc. AMR abstracts in numerous ways, e.g., by assigning the same conceptual structure to fear (v), fear (n), and afraid (adj). Figure 1 gives an example.
The soldier was not afraid to die. The soldier did not fear death. been substantial interest in creating parsers to recover this formalism from plain text. Several parsers were released in the past couple of years (Flanigan et al., 2014;Wang et al., 2015b;Werling et al., 2015;Wang et al., 2015a;Artzi et al., 2015;Pust et al., 2015). This body of work constitutes many diverse and interesting scientific contributions, but it is difficult to adequately determine which parser is numerically superior, due to heterogeneous evaluation decisions and the lack of a controlled blind evaluation. The purpose of this task, therefore, was to provide a competitive environment in which to determine one winner and award a trophy to said winner.

Training Data
LDC released a new corpus of AMRs (LDC2015E86), created as part of the DARPA DEFT program, in August of 2015. The new corpus, which was annotated by teams at SDL, LDC, and the University of Colorado, and supervised by Ulf Hermjakob at USC/ISI, is an extension of pre-vious releases (LDC2014E41 and LDC2014T12). It contains 19,572 sentences (subsuming, in turn, the 18,779 AMRs from LDC2014E41 and the 13,051 AMRs from LDC2014T12), partitioned into training, development, and test splits, from a variety of news and discussion forum sources. The AMRs in this corpus have changed somewhat from their counterparts in LDC2014E41, consistent with the evolution of the AMR standard. They now contain wikification via the :wiki attribute, they use new (as of July 2015) PropBank framesets that are unified across parts of speech, they have been deepened in a number of ways, and various corrections have been applied.

Other Resources
We made the following resources available to participants: • The aforementioned AMR corpus (LDC2015E86), which included automatically generated AMR-English alignments over tokenized sentences.
• The tokenizer (from Ulf Hermjakob) used to produce the tokenized sentences in the training corpus.
• The AMR specification, used by annotators in producing the AMRs. 1 • A deterministic, input-agnostic trivial baseline 'parser' courtesy of Ulf Hermjakob.
• The JAMR parser (Flanigan et al., 2014) as a strong baseline. We provided setup scripts to process the released training data but otherwise provided the parser as is.
• The same Smatch  scoring script used in the evaluation.
• A Python AMR manipulation library, from Nathan Schneider.

Evaluation Data
For the specific purposes of this task, DEFT commissioned and LDC released an additional set of English sentences along with AMR annotations 2 that had not been previously seen. This blind evaluation set consists of 1,053 sentences in a roughly 50/50 discussion forum/newswire split. The distribution of sentences by source is shown in Table 1.

Task Definition
We deliberately chose a single, simple task. Participants were given English sentences and had to return an AMR graph (henceforth, 'an AMR') for each sentence. AMRs were scored against a gold AMR with the Smatch heuristic F1-derived tool and metric. Smatch  is calculated by matching instance, attribute, and relation tuples to a reference AMR (See Section 7.2). Since variable naming need not be globally consistent, heuristic hill-climbing is done to search for the best match in sub-exponential time. A trophy was given to the team with the highest Smatch score under consistent heuristic conditions. 3

Participants and Results
11 teams participated in the task. 4 Their systems and scores are shown in Table 2. Below are brief descriptions of each of the various systems, based on summaries provided by the system authors. Readers are encouraged to consult individual system description papers for more details.

CAMR-based systems
A number of teams made use of the CAMR system from Wang et al. (2015a). These systems proved among the highest-scoring and had little variance from each other in terms of system score.
6.1.1 Brandeis / cemantix.org / RPI (Wang et al., 2016) This team, the originators of CAMR, started with their existing AMR parser and experimented with three sets of new features: 1) rich named entities, 2) a verbalization list, and 3) semantic role labels. They also used the RPI Wikifier to wikify the concepts in the AMR graph.
6.1.2 ICL-HD (Brandt et al., 2016) This team attempted to improve AMR parsing by exploiting preposition semantic role labeling information retrieved from a multi-layer feed-forward neural network. Prepositional semantics was included as features into CAMR. The inclusion of the features modified the behavior of CAMR when creating meaning representations triggered by prepositional semantics.

RIGA (Barzdins and Gosko, 2016)
Besides developing a novel character-level neural translation based AMR parser, this team also extended the Smatch scoring tool with the C6.0 rule-based classifier to produce a human-readable report on the error patterns frequency observed in the scored AMR graphs. They improved CAMR by adding to it a manually crafted wrapper fixing the identified CAMR parser errors. A small further gain was achieved by combining the neural and CAMR+wrapper parsers in an ensemble.
6.1.4 M2L (Puzikov et al., 2016) This team attempted to improve upon CAMR by using a feed-forward neural network classification algorithm. They also experimented with various ways of enriching CAMR's feature set. Unlike ICL-HD and RIGA they were not able to benefit from feed-forward neural networks, but were able to benefit from feature enhancements.

Other Approaches
The other teams either improved upon their existing AMR parsers, converted existing semantic parsing tools and pipelines into AMR, or constructed AMR parsers from scratch with novel techniques.
6.2.1 CLIP@UMD (Rao et al., 2016) This team developed a novel technique for AMR parsing that uses the Learning to Search (L2S) algorithm. They decomposed the AMR prediction problem into three problems-that of predicting the concepts, predicting the root, and predicting the relations between the predicted concepts. Using L2S allowed them to model the learning of concepts and relations in a unified framework which aims to minimize the loss over the entire predicted structure, as opposed to minimizing the loss over concepts and relations in two separate stages. (Flanigan et al., 2016) This team's entry is a set of improvements to JAMR (Flanigan et al., 2014). The improvements are: a novel training loss function for structured prediction, new sources for concepts, improved features, and improvements to the rule-based aligner in Flanigan et al. (2014). The overall architecture of the system and the decoding algorithms for con-   (Butler, 2016) No use was made of the training data provided by the task. Instead, existing components were combined to form a pipeline able to take raw sentences as input and output meaning representations. The components are a part-of-speech tagger and parser trained on the Penn Parsed Corpus of Modern British English to produce syntactic parse trees, a semantic role labeler, and a named entity recognizer to supplement obtained parse trees with word sense, functional and named entity information. This information is passed into an adapted Tarskian satisfaction relation for a Dynamic Semantics that is used to transform a syntactic parse into a predicate logic based meaning representation, followed by conversion to the required Penman notation. (Bjerva et al., 2016) This team employed an existing open-domain semantic parser, Boxer (Curran et al., 2007), which produces semantic representations based on Discourse Representation Theory. As the meaning representations produced by Boxer are considerably different from AMRs, the team used a hybrid conversion method to map Boxer's output to AMRs. This process involves lexical adaptation, a conver-sion from DRT-representations to AMR, as well as post-processing of the output. (Goodman et al., 2016) This team developed a novel transition-based parsing algorithm using exact imitation learning, in which the parser learns a statistical model by imitating the actions of an expert on the training data. They used the imitation learning algorithm DAG-GER to improve the performance, and applied an alpha-bound as a simple noise reduction technique.
They applied Markov Chain Monte Carlo (MCMC) algorithms to learn Synchronous Hyperedge Replacement Grammar (SHRG) rules from a forest that represents likely derivations consistent with a fixed string-to-graph alignment (extracted using an automatic aligner). They make an analogy of string-to-AMR parsing to the task of phrase-based machine translation and came up with an efficient algorithm to learn graph grammars from string-graph pairs. They proposed an effective approximation strategy to resolve the complexity issue of graph compositions. Then they used the Earley algorithm with cube-pruning for AMR parsing given new sentences and the learned SHRG.
6.2.7 CU-NLP (Foland and Martin, 2016) This parser does not rely on a syntactic pre-parse, or heavily engineered features, and uses five recurrent neural networks as the key architectural components for estimating AMR graph structure.

Result Ablations
We conduct several ablations to attempt to empirically determine what aspects of the AMR parsing task were more or less difficult for the various systems.

Impact of Wikification
The AMR standard has recently been expanded to include wikification and the data used in this task reflected that expansion. Since this is a rather recent change to the standard and requires some kind of global external knowledge of, at a minimum, Wikipedia's ontology, we suspected performance on :wiki attributes would suffer. To measure the effect of wikification, we performed two ablation experiments, the results of which are in Figure 2. In the first ("no wiki"), we removed :wiki attributes and their values from reference and system sets before scoring. In the second ("bad wiki"), we replaced the value of all :wiki attributes with a dummy entry to artificially create systems that did not get any wikification correct.
The "no wiki" ablations show that the inclusion of wikification into the AMR standard had a very small impact on overall system scores. No system's score changed by more than 0.01 when wikification was removed, indicating that systems appear to wikify about as well as they handle the rest of AMR's attributes. The "bad wiki" ablations show performance drop when wikification is corrupted of around 0.02 to 0.03 for six of the systems, and a negligible performance drop for the remaining systems. This result indicates that the systems with a performance drop are doing a fairly good job at wikification.

Performance on different parts of the AMR
In this set of ablations we examine systems' relative performance on correctly identifying instances, attributes, and relations of the AMRs. Instances are the labeled nodes of the AMR. In the example AMR of Figure 1, the instances are fear-01, soldier, and die-01. To match an instance one must simply match the instance's label. 5 Attributes are labeled string properties of nodes. In the example AMR, there is a polarity attribute attached to the fear-01 instance with a value of "-." There is also an implicit attribute of "TOP" attached to the root node of the graph, with the node's instance as the attribute value. To match an attribute one must match the attribute's label and value, and the attribute's instance must be aligned with the corresponding instance in the reference graph.
Relations are labeled edges between two instances. In the example AMR, the relations (f, s, ARG0), (f, d, ARG1), and (d, s, ARG1) exist. To match a relation, the labeled edge between two nodes of the hypothesis must match the label of the edge between the correspondingly aligned nodes of the reference graph.
It should not be surprising that systems tend to perform best at instance matching and worst at relation matching. Note, however, that the best performing systems on instances and relations were not the overall best performing systems. Ablation results can be seen in Table 3.

Performance on different data sources
As discussed in Section 8, less formal sentences, sentences with misspellings, and sentences with non-standard representations of meaning were the hardest to parse. We ablate the results by domain of origin in Table 4. While the strongest-performing systems tended to perform best across ablations, we note that the machine-translated and informal corpora were overall the hardest sections to parse.

Qualitative Comparison
In this section we examine some of the sentences that the systems found particularly easy or difficult to parse.

Easiest Sentences
The easiest sentence to parse in the eval corpus was the sentence "I was tempted." 6 It has a gold AMR of: (t / tempt-01 :ARG1 (i / i)) The mean score for this sentence was 0.977. All submitted systems except one parsed it perfectly.
Another sentence that was quite easy to parse was the sentence "David Cameron is the prime minister of the United Kingdom." 7 Two systems parsed it perfectly and a third omitted wikification but was otherwise perfect. Figure 3 shows a detailed comparison of each system's performance on the sentence. In general we see that shorter sentences from the familiar and formal news domain are parsed best by the submitted systems.

(y / yes) M E D I A A D V I S O R Y (a / advise-01
:ARG1 (m / media)) Data noise was another confounding factor. In the next example, 9 which had an average score of 0.17, parsers were confused both by the misspelling ("lie" for "like") and by the quoted title, which all systems except UCL+Sheffield, tried to parse for meaning. Why not a title lie "School Officials Screw over Rape Victim?" (t / title-01 :polarity -:ARG1-of (r / resemble-01 :ARG2 (t2 / title-01 :wiki "A_Rape_on_Campus" :name (n2 / name :op1 "School" :op2 "Officials" :op3 "Screw" :op4 "Over" :op5 "Rape" :op6 "Victim"))) :ARG1-of (c / cause-01 :ARG0 (a / amr-unknown))) We note that all of these difficult sentences are not conceptually hard for humans to parse. Humans have far less difficulty in resolving errors or processing non-standard tokenization than do computers. 9 There Can Be Only One?
We intended to award a single trophy to the single best system, according to the narrow evaluation conditions (balanced F1 via Smatch 2.0.2 with 5 restarts, to two decimal places). However, the top two systems, Brandeis/cemantix.org/RPI and RIGA, scored identically according to that metric. Hoping to elicit some consistent difference between the systems, we ran Smatch with 20 restarts, looked at four decimal places, and re-ran five times. Each system scored a mean of 0.6214 with standard deviation of 0.00013. We thus capitulate in the face of overwhelming statistics and award the inaugural trophy to both teams, equally. 10

Conclusion
The results of this competition and the interest in participation in it demonstrate that AMR parsing is a difficult, competitive task. The large number of systems using released code lowered the bar to entry significantly but may have led to a narrowing of diversity in approaches. Low-level irregularities such as creative tokenization and misspellings befuddled the systems. We hope to conduct another AMR parsing competition in the future, in the biomedical domain, and also conduct a generation competition.