SemEval-2017 Task 9: Abstract Meaning Representation Parsing and Generation

In this report we summarize the results of the 2017 AMR SemEval shared task. The task consisted of two separate yet related subtasks. In the parsing subtask, participants were asked to produce Abstract Meaning Representation (AMR) (Banarescu et al., 2013) graphs for a set of English sentences in the biomedical domain. In the generation subtask, participants were asked to generate English sentences given AMR graphs in the news/forum domain. A total of five sites participated in the parsing subtask, and four participated in the generation subtask. Along with a description of the task and the participants’ systems, we show various score ablations and some sample outputs.


Introduction
Abstract Meaning Representation (AMR) is a compact, readable, whole-sentence semantic annotation (Banarescu et al., 2013). It includes entity identification and typing, PropBank semantic roles (Kingsbury and Palmer, 2002), individual entities playing multiple roles, as well as treatments of modality, negation, etc. AMR abstracts in numerous ways, e.g., by assigning the same conceptual structure to fear (v), fear (n), and afraid (adj). Figure 1 gives an example.
In 2016 an AMR parsing shared task was held at SemEval (May, 2016). Task participants demonstrated several new directions in AMR parsing technology and also validated the strong performance of existing parsers. We sought, in 2017, to focus AMR parsing performance on the biomedical domain, for which a not insignificant but still relatively small training corpus had been produced. While sentences from this domain are quite The soldier was not afraid of dying. The soldier was not afraid to die. The soldier did not fear death.
formal compared to some of those evaluated in last year's task, they are also very complex, and have many terms unique to the domain. An example is shown in Figure 2. We continue to use Smatch ) as a metric for AMR parsing, but we perform additional ablative analysis using the approach proposed by Damonte et al. (2016). Along with parsing into AMR, it is important to encourage improvements in automatic generation of natural language (NL) text from AMR. Humans favor communication in NL. An AI that is able to parse text into AMR at a quality level indistinguishable from humans may be said to understand NL, but without the ability to render its own semantic representations into NL no human will ever be able to appreciate this.
The advent of several systems that generate English text from AMR input (Flanigan et al., 2016b;Pourdamghani et al., 2016) inspired us to conduct a generation-based shared task from AMRs in the news/discussion forum domain. For the generation subtask, we solicited human judgments of sentence quality. We followed the precedent established by the Workshop in Machine Translation (Bojar et al., 2016) and used the Appraise solicitation system (Federmann, 2012), lightly mod-Interestingly, serpinE2 mRNA and protein were also markedly enhanced in human CRC cells exhibiting mutation in <i>KRAS </i>and <i>BRAF</i>.
ified, to gather human rankings, then TrueSkill (Sakaguchi et al., 2014) to elicit an overall system ranking.
Since the same training data and tools are available to both subtasks (though, in the case of the generation subtask, the utility of the Bio-AMR corpus is unclear), we will describe all the resources for both subtasks in Sections 2 and 3 but then will handle descriptions and ablations for the parsing and generation subtasks separately, in, respectively, Sections 4 and 5. Readers interested in only one of these subtasks should not feel compelled to read the other section. We will reconvene in Section 6 to conclude and discuss hardware, as we continue the tradition established last year in the awarding of trophies to the declared winners of each subtask.

Data
LDC released a new corpus of AMRs (LDC2016E25), created as part of the DARPA DEFT program, in March of 2016. The new corpus, which was annotated by teams at SDL, LDC, and the University of Colorado, and su-pervised by Ulf Hermjakob at USC/ISI, is an extension of previous releases (LDC2015E86, LDC2014E41 and LDC2014T12). It contains 39,260 sentences (subsuming, in turn, the 19,572 AMRs from LDC2015E86, the 18,779 AMRs from LDC2014E41, and the 13,051 AMRs from LDC2014T12), partitioned into training, development, and test splits, from a variety of news and discussion forum sources. Participants in the generation task only were provided with AMRs for an additional 1,293 sentences for evaluation; the original sentences were also provided, as needed, to human evaluators during the human evaluation phase of the generation subtask (see Section 5.2). These sentences and their corresponding AMRs were sequestered and never released as data before the evaluation phase.
We also made available the Bio-AMR corpus version 0.8, which consists of 6,452 AMR annotations of sentences from cancer-related PubMed articles, covering 3 full papers 1 as well as the result sections of 46 additional PubMed papers. The corpus also includes about 1000 sentences each from the BEL BioCreative training corpus and the Chicago Corpus. The Bio-AMR corpus was partitioned into training, development, and test splits. An additional 500 sentences and their AMRs were sequestered until the evaluation phase, at which point the sentences were provided to parsing task participants only. Table 1 summarizes the available data, including the split sizes.

Other Resources
We made the following resources available to participants: • The tokenizer (from Ulf Hermjakob) used to produce the tokenized sentences in the training corpus. 2 • The AMR specification, used by annotators in producing the AMRs. 3 • The JAMR (Flanigan et al., 2014)   • The JAMR (Flanigan et al., 2016b) generation system, as a strong generation baseline.

Systems
Five teams participated in the task, a noticeable decline from last year's task, which saw eleven full participants. One team submitted two systems for a total of six distinct systems. Two teams were repeats from last year: CMU and RIGOTRIO (previously RIGA). Below are brief descriptions of each of the various systems, based on summaries provided by the system authors. Readers are encouraged to consult individual system description papers or relevant conference paper descriptions for more details.

The Meaning Factory (van Noord and Bos, 2017)
This team submitted two parsers. TMF-1 is a character-level sequence-to-sequence deep learn-  Table 2: Main parsing results: For Smatch, a mean of ten runs with ten restarts per run is shown; standard deviation was about 0.0003 per system. For the remaining ablations, a single run was used.
ing model 9 similar to that of Barzdins and Gosko (2016), but with a number of pre-and postprocessing changes to improve results. TMF-2 is an ensemble of CAMR (Wang et al., 2015b) models trained on different data sets and the seq-to-seq model to find the best CAMR parse. (Nguyen and Nguyen, 2017) This team implemented two wrapper layers for CAMR (Wang et al., 2015a). The first layer standardizes and adds additional information to input sentences to eliminate the weakness of the dependency parser observed when parsing scientific quotations, figures, formulas, etc. The second layer wraps the output data of CAMR. It is based on a prebuilt list of (biology term-AMR structure) pairs to fix the output data of CAMR. This makes CAMR deal with unknown scientific concepts better. (Buys and Blunsom, 2017) This is a neural encoder-decoder AMR parser modeling the alignment between graph nodes and sentence tokens explicitly with a pointer mechanism. Candidate lemmas are predicted as a preprocessing step so that the lemmas of lexical node labels are factored out of the graph linearization. 9 https://www.tensorflow.org/ tutorials/seq2seq/

CMU
This was the same JAMR parsing system used in last year's evaluation (Flanigan et al., 2016a). The participants declined to submit a new system description paper. (Gruzitis et al., 2017) This team extended their CAMR-based AMR parser from last year's shared task (Barzdins and Gosko, 2016) with a gazetteer for recognizing as named entities the biomedical compounds frequently mentioned in the biomedical texts. The gazetteer was populated from the provided biomedical AMR training data.

Quantitative Ablation
We made use of the analysis scripts produced by Damonte et al. (2016) to conduct a more finegrained ablation of scores. As noted in that work, Smatch provides full-sentence analysis but some aspects of an AMR are more difficult to parse correctly than others. The ablation study considers only (or excludes) an aspect of the AMR and then calculates Smatch (or F1, when no heuristic matching is needed) with that limitation in place. Ablation scores are shown in Table 2. The ablations are: 10 • Unlabeled: All argument labels (e.g. ARG0, location) are replaced with a single common label • No WSD: Propbank frames indicating different senses (such as die-01 vs die-02) are conflated • NER: Only named entities are scored; that is, in both reference and hypothesis AMR, only nodes with an incoming arc labeled name are considered.
• Wiki: Only wikifications are scored; this is achieved in a manner similar to NER but with the incoming arc labeled wiki.
• Negation: Only concepts with an outgoing polarity arc are considered. In practice this arc is only used to indicate negation.
• Concepts: Only concepts, not relations, are scored.
• Reentrancies: Only concepts with two or more incoming relations are scored. Reentrancies occur when a concept has several mentions in a sentence, or where an 'inverted' relation (one that ends in -of) occurs, implying inverse dependency. In practice the latter is much more often the cause of a re-entrancy.
• Semantic Role Labeling (SRL): only relations corresponding to roles in PropBank, i.e. those named ARG0 and the like, are scored.
The ablation results show that superior performance in Smatch correlates with superior performance in the Unlabeled, No-WSD, NER, and Concepts performance. Additionally, Figure 4, which plots each ablation score against Smatch and induces a linear regression, shows that six of the eight ablation sub-metrics are well correlated with Smatch; only wikification and negation are not. Wikification is generally handled as a separate process on top of overall AMR parsing; this may explain that discrepancy. We have no great explanation for negation's weak correlation but note that it is generally considered a difficult task in semantics.

Discussion
It is interesting to note that the top-scoring system was, as in last year's shared task, based on CAMR (Wang et al., 2015b). It is also interesting to note that, in the Oxford team's submission, once again, a pure neural system is nearly as good as the CAMR system, despite having rather little data to train on. The Oxford system appears to be quite different from last year's neural submission (Foland and Martin, 2016) but nevertheless is a strong competitor. Finally, the top-scoring system, that of UIT-DANGNT-CLNLP, got a 0.61 Smatch, while last year's top scoring systems (Barzdins and Gosko, 2016;Wang et al., 2016) scored a 0.62, practically the same score. This, despite the fact that the evaluation corpora were quite different. One might expect the biomedical corpus to be easier to parse than the news/forum corpus, since its sentences are rather formal, and do not use slang or incorrect syntax. On the other hand, the sentences in the biomedical corpus are on average longer than those in the news/forum corpus (on average 25 words in bio vs. 14.5 in news/forum) and the biomedical corpus contains many unknown words, corresponding to domain terminology not in general use (1-count words are 9% of tokens in bio training, vs. 7.2% in news/forum). The news/forum corpus has, in its forum content, colloquialisms and writing variants that are very difficult to automatically analyze. Perhaps the relatively 'easy' and 'hard' parts of each corpus canceled each other out, yielding corpora that were about the same level of difficulty to parse. Nevertheless, it is somewhat concerning that AMR parsing quality appears to have stalled, as parsing performance remains in the low 0.60 range.

Generation Sub-Task
As AMR provides full-sentence semantics, it may be a suitable formalism for semantics-to-text generation. This subtask explored the suitability of that hypothesis. Given that AMRs do not capture non-semantic surface phenomena nor some essential properties of realized text such as tense, we incorporated human judgments into our evaluation, since automatic metrics against a single reference were practically guaranteed to be inadequate.

Systems
Four teams participated in the task. We also included a submission from Pourdamghani et al. (2016) run by the organizer, though a priori declared that submission to be non-competitive due to a conflict of interest. Below we provide short summaries of each team's approach.

CMU
This was the JAMR generation system described in (Flanigan et al., 2016b). The participants declined to submit a system description paper.

Sheffield (Lampouras and Vlachos, 2017)
This team's method is based on inverting previous work on transition-based parsers, and casts NLG from AMR as a sequence of actions (e.g., insert/remove/rename edges and nodes) that progressively transform the AMR graph into a syntactic parse tree. It achieves this by employing a sequence of four classifiers, each focusing on a subset of the transition actions, and finally realizing the syntactic parse tree into the final sentence.

RIGOTRIO (Gruzitis et al., 2017)
For generation, this team's approach was to write transformation rules for converting AMR into Grammatical Framework (Ranta, 2004) abstract syntax from which semantically correct English text can be rendered automatically. In reality the approach worked for 10% of AMRs. For the submission the remaining 90% AMRs were converted to text using the JAMR (Flanigan et al., 2014) tool.

FORGe (Simon Mille and Wanner, 2017)
UPF-TALN's generation pipeline comprises a series of rule-based graph-transducers, for the syn-tacticization of the input graphs (converted to CoNLL format) and the resolution of morphological agreements, and an off-the-shelf statistical linearization component.

ISI
This was an internal, non-trophy-eligible submission based on the work of Pourdamghani et al. (2016). It views generation as phrase based machine translation and learns a linearization of AMR such that the result can be used in an offthe-shelf Moses (Koehn et al., 2007) PBMT implementation.

Manual Evaluation
We used Appraise (Federmann, 2012), an opensource system for manual evaluation of machine translation, to conduct a human evaluation of generation quality. The system asks human judges to rank randomly selected systems' translations of sentences from the test corpus. This in turn yields pairwise preference information that can be used to effect an overall system ranking.
For the purposes of this task we needed to adapt the Appraise system to admit nested representations of AMRs, and to be compatible with our IT infrastructure. A screen shot is shown in Figure 3.

Scoring
We provided BLEU as a potentially helpful automatic metric but consider several metrics induced over pairwise comparisons induced by manual evaluation to be the "true" evaluation metric for the purposes of trophy-awarding: • Win+tie percentage: This is simply the percentage "wins" (better pairwise comparisons) plus "ties" (equal comparisons) of the total number of its pairwise comparisons. This metric was largely used to induce rankings from human judgments through WMT 2011.
• Win percentage: This is a "harsher" version of Win+tie; the percentage is wins wins+ties+losses .
Essentially, ties are judged as losses. This was used in WMT 2011 and 2012.
• TrueSkill (Sakaguchi et al., 2014). This is an adaptation of a metric developed for player rankings in ongoing competitions such as on Microsoft Xbox Live. The metric maintains estimates of player (i.e., generation system)   Table 4: Human judgments of generation results after self-judgments are removed: The results are fundamentally the same ability as Gaussian distributions and rewards events (i.e., pairwise rankings of outputs) that are unexpected, such as a poorly ranked player outperforming a highly-ranked player, more than expected events.
We note that the three metrics derived from human pairwise rankings agree with the relative ordering of the submitted systems' abilities on the evaluation data, while the BLEU metric does not. It is not terribly surprising the BLEU does not correlate with human judgment; it was designed for a very different task.
Since the participants in this task were also judges in the human evaluation, we were somewhat concerned that implicit bias might lead to a skewing of the results, even though system identification was not available during evaluation. We thus removed all judgments that involved selfscoring and recalculated results. The results, shown in Table 4, show little difference from the main results.

Qualitative Analysis
The generation task was quite challenging, as generation from AMR is still a nascent field. Table 5 shows an example of a single AMR and the content generated by each system for it, along with the number of wins, ties, and losses per system by the human evaluations (note: not all segments were scored for all systems, and not all systems received the same number of comparisons). Some systematic errors, such as incorporating label text into the generation, could lead to improvements, as could a stronger language model; generated output is often disfluent.

Conclusion
Both biomedical AMR parsing and generation from AMRs appear to be challenging tasks; perhaps too challenging, as the number of participants in either subtask was significantly lower than the participation rate from a year ago. However, we observed that AMR parsing quality on the seemingly more difficult biomedical domain was no worse than that observed on the news/forum domain. In fact, the same fundamental technology that dominated in last year's evaluation once again reigned supreme. A concern that Smatch was too coarse a metric to evaluate AMRs was not borne out, as scores in an ablation study tracked well with the overall Smatch score. We are pleased to award the parsing trophy to the UIT-DANGNT-CLNLP team, which added domain-specific modification to the strong CAMR (Wang et al., 2015b) parsing platform.
On the generation side, it seems that there is still a long way to go to reach fluency. We note that BLEU, which is often used as a generation metric, is woefully inadequate compared to human evaluation. We hope the analysis presented here will lead to better generation systems in the future. It was clear from the human evaluations, however, that the RIGOTRIO team prevailed and will receive the generation trophy.