A Human Evaluation of AMR-to-English Generation Systems

Most current state-of-the art systems for generating English text from Abstract Meaning Representation (AMR) have been evaluated only using automated metrics, such as BLEU, which are known to be problematic for natural language generation. In this work, we present the results of a new human evaluation which collects fluency and adequacy scores, as well as categorization of error types, for several recent AMR generation systems. We discuss the relative quality of these systems and how our results compare to those of automatic metrics, finding that while the metrics are mostly successful in ranking systems overall, collecting human judgments allows for more nuanced comparisons. We also analyze common errors made by these systems.


Introduction
Abstract Meaning Representation, or AMR (Banarescu et al., 2013), is a representation of the meaning of a sentence as a rooted, labeled, directed acyclic graph.For example, (l / label-01 :ARG0 (c / country :wiki "Georgia _ (country)" :name (n / name :op1 "Georgia")) :ARG1 (s / support-01 :ARG0 (c2 / country :wiki "Russia" :name (n2 / name :op1 "Russia"))) :ARG2 (a / act-02 :mod (a2 / annex-01))) represents the sentence "Georgia labeled Russia's support an act of annexation."AMR does not represent some morphological and syntactic details such as tense, number, definiteness and word order; thus, this same AMR could also represent alternate phrasings such as "Russia's support is being labeled an act of annexation by Georgia." AMR generation is the task of generating a sentence in natural language (in this case, English) from an AMR graph.This has applications to a range of NLP tasks, including summarization (Liao et al., 2018) and machine translation (Song et al., 2019).Like other Natural Language Generation (NLG) tasks, this is difficult to evaluate due to the range of possible valid sentences corresponding to any single AMR.
Currently, AMR generation systems are typically evaluated only with automatic metrics that compare a generated sentence to a single humanauthored reference; for AMR, this is the sentence from which the AMR graph was originally created.However, there is evidence that these metrics may not be a good representation of human judgments for AMR generation (May and Priyadarshi, 2017) and NLG in general (see §2.1).
Thus, in this work, we present a new human evaluation of several recent AMR generation systems, most of which have not previously been manually evaluated.Our methodology ( §3) differs in several ways from previous evaluations of AMR generation, including separate direct assessment of fluency and adequacy; and asking annotators to evaluate sentences without comparison to a reference, in order to avoid biasing them toward a particular wording.We analyze ( §4) what our results show about the relative quality of the systems and how this compares to their scores from automatic metrics, finding that these metrics are mostly accurate in ranking systems, but that collecting separate judgments for fluency, adequacy, and error types allows us to characterize the relative strengths and weaknesses of each system in more detail.Finally, we discuss common errors among sentences which received low scores from annotators, identifying issues for future researchers to address including hallucination, anonymization, and repetition.
In §2.1 we discuss previous work on evaluation of AMR generation and related NLG tasks, both with automatic metrics and human evaluation.In §2.2 we survey recent work in AMR generation, including describing the systems which we evaluate.

Evaluation of AMR Generation
Automatic Metrics: The vast majority of AMR generation papers measure their performance only with automatic metrics.The most common of these metrics is BLEU (Papineni et al., 2002), which is typically used to determine the state of the art.However, it is unclear whether BLEU is a reliable metric to compare AMR generation systems: May and Priyadarshi (2017) found that BLEU disagreed with human judgments on the ranking of five AMR generation systems, including disagreeing on which system was the best.Concerns have also been raised about the suitability of BLEU for NLG in general; for example, Reiter (2018) found that BLEU has generally poor correlations with human judgments for NLG.Novikova et al. (2017) compared many metrics to human judgments on NLG from meaning representations and concluded that use of reference-based metrics relies on an invalid assumption that references are correct and complete enough to be used as a gold standard.Some recent AMR generation papers have reported other automatic metrics alongside BLEU.Many have reported METEOR (Banerjee and Lavie, 2005), and a few have included TER (Snover et al., 2006) and, most recently, CHRF++ (Popović, 2017).However, it is unclear how accurately any of these metrics capture the relative performance of AMR generation systems.Human Evaluation: Prior to this work, the only human evaluation comparing several AMR generation systems was the SemEval-2017 AMR shared task, which used a ranking-based evaluation of five systems (May and Priyadarshi, 2017).All of these systems perform far below the current state-of-theart, making a new evaluation necessary.
While most AMR generation papers have reported no human evaluation of their systems, a few have conducted smaller-scale evaluations.In particular, Ribeiro et al. ( 2019) conducted a Mechanical Turk evaluation to compare their best graph encoder model with a sequence-to-sequence baseline, finding that their model performs better on both meaning similarity between the generated sentence and the gold reference, and readability of the generated sentence.
Lapalme (2019) also conducted a small human evaluation in which annotators chose the best output out of three options: their own system, ISI (Pourdamghani et al., 2016), and JAMR (Flanigan et al., 2016).They find that their rule-based system is on par with ISI and much better than JAMR, despite having a much lower BLEU.
Beyond AMR generation, other NLG tasks are also often evaluated only with automatic metrics; for example, Gkatzia and Mahamood (2015) found that 38.2% of NLG papers overall, and 68% of those published in ACL venues, used automatic metrics.However, as discussed above, many studies have found that these metrics are not a reliable proxy for human judgments.One example of the use of human evaluation is the Conference on Machine Translation (WMT), which runs an annual evaluation of machine translation systems (e.g.Barrault et al., 2019).

Recent Advances in AMR Generation
Shortly after the 2017 shared task, Konstas et al. (2017) made significant advances to the field with a neural sequence-to-sequence approach, mitigating the limitations of the small amount of AMRannotated data by augmenting training data with a jointly-trained parser.
Later work by Song et al. (2018) builds off this approach but uses a graph-to-sequence model to preserve more information from the structure of the AMR.Several recent papers have explored variations on a graph-to-sequence approach: improvements in encoding reentrancies and longrange dependencies (Damonte and Cohen, 2019), a dual graph encoder that captures both top-down and bottom-up representations of graph structure (Ribeiro et al., 2019), and a densely-connected graph convolutional network (Guo et al., 2019).
Recent sequence-to-sequence approaches include using structure-aware self-attention to capture relations between concepts within a sequenceto-sequence transformer model (Zhu et al., 2019), and generating syntactic constituency trees as an intermediate step before generating surface structure (Cao and Clark, 2019).
While neural approaches have achieved stateof-the-art BLEU scores, a few recent works have instead approached AMR generation through more rule-based methods.Manning (2019) constrains their system with rules, supplemented by simple  statistical models, to avoid certain types of errors, such as hallucinations, that are possible in neural systems.Lapalme (2019) create a fully rule-based generation system to help humans check their AMR annotations.

Methodology
We conduct a human evaluation of several AMR generation systems.§3.1 discusses the general survey design, while §3.2 discusses details of the pilot survey, which validates the methodology by applying it to data from the SemEval evaluation, and §3.3 discusses the evaluation of more recent systems.

Survey Design
Figures 1 and 2 show examples of the survey interface for one sentence.Scalar Scores: The SemEval-2017 evaluation of AMR generation elicited judgments in the form of relative rankings of output from three systems at a time (May and Priyadarshi, 2017).However, recent work in evaluation of machine translation (Bojar et al., 2016) has found that direct assessment is a preferable method to collect judgments, partly because it evaluates absolute quality of translations.We use a similar direct assessment method, providing annotators with a slider which represents scores from 0 to 100, although annotators are not shown numbers.Unlike recent WMT evaluations, we collect separate scalar scores for fluency and adequacy.This has been common practice in many evaluations of NLG and MT; for example, Gatt and Belz (2010) also use separate direct assessment sliders for these two dimensions for NLG.
Referenceless Design: Many human evaluations of NLG and MT, including the SemEval evaluation for AMR, provide a reference for the annotator to compare to the system output.However, since AMR is underspecified with respect to many aspects of phrasing including tense, number, word order, and definiteness, comparison to a single reference risks biasing annotators toward the specific phrasing used in the reference.Thus, each survey given to annotators consists of two sections: in the first half, annotators judged fluency, and saw only the output sentences; in the second, they judged the same sentences on adequacy, and were shown the AMR from which the sentence was generated, allowing them to compare the meanings.This design required that our annotators be familiar with the AMR scheme to identify mismatches in the concepts and relations expressed in the sentences.
Adequacy Error Types: In addition to numeric scores, under each adequacy slider are three checkboxes where annotators can indicate whether certain types of adequacy errors apply: • That they cannot understand the meaning of the utterance (i.e. it is disfluent enough to be incomprehensible, making it difficult to meaningfully judge adequacy) • That information in the AMR is missing from the utterance • That information not present in the AMR is added in the utterance These options allow for a more nuanced analysis of the types of mistakes made by different systems than numerical scores alone would provide.
Survey Structure: Instructions for judging fluency are provided at the beginning of the survey, and instructions for adequacy are shown before the start of the adequacy portion.For fluency, annotators are asked to "indicate how well each one represents fluent English, like you might expect a person who is a native speaker of English to use," and told that "some of these may be sentence frag-ments rather than complete sentences, but can still be considered fluent utterances."For adequacy, they are instructed to "determine how accurately the sentence expresses the meaning in the AMR." The full text of these instructions, which also includes examples, is provided in the supplementary material.
Each page of the survey includes each system's output for a given sentence, presented in a random order.The reference is also included as a sentence to judge, but is not distinguished from the system outputs.

Pilot Evaluation
Before collecting the full dataset of human judgments for AMR generation, we completed a smaller pilot experiment to test the validity and practicality of the methodology.This pilot used the data and systems included in the SemEval-2017 shared task (May and Priyadarshi, 2017).A random subset of 25 out of the 1293 sentences in the dataset were used.All were annotated by three annotators, each of whom was a linguist with experience with AMR.
We tweaked the design of the later survey based on feedback from the pilot annotators.In particular, the surveys were shortened (annotators completed two batches of 10 sentences each, instead of one with 25); more thorough instructions were given, with examples; and wording was changed from "sentence" to "utterance" to reflect that some are not full sentences in a grammatical sense.

Main Evaluation
The main evaluation was larger in scope than the pilot, and evaluated more recent systems, most of which are of a markedly higher quality than those in the pilot.We contacted the authors of several recent papers on AMR-to-English generation to obtain their system's output for use in the evaluation, and included all five systems for which we were able to obtain usable data in time to begin our evaluation: Konstas et al. (2017), Guo et al. (2019), Manning (2019), Ribeiro et al. (2019), and Zhu et al. (2019).These systems are described in §2.2.

Data:
The LDC2015E86 and LDC2017T10 AMR test sets contain the same sentences, with some updates to the AMRs.Because some of the system output we obtained was generated from the 2015 AMRs and some from the 2017, we decided to only include AMRs at the intersection of these datasets in our evaluation.Additionally, we chose to exclude AMRs whose root relation was multisentence, which indicates that the portion of text officially segmented as one sentence includes what AMR annotators analyzed as two or more sentences.These were excluded because they are often very long and pilot annotators found they could be very difficult to read and evaluate, and because unlike other AMR relations, multisentence does not represent a semantic relationship between elements of meaning.
A total of 335 sentences were excluded from consideration due to differences in their AMRs between the different versions of the data, and 71 for being multi-sentence.Accounting for overlap between the excluded sets, 998 out of 1371 total sentences in the test set were considered eligible for our evaluation.A random sample of 100 of these were used in the survey.
Annotation: A total of nine annotators participated in this evaluation, including the three who participated in the pilot.All had prior training in AMR annotation, mostly from taking a semesterlong course focused on AMR and other meaning representations.Each annotated two different batches of 10 sentences each, except for one anno-q q q q q q q q q q q q 0 50 100 annotator score (a) Fluency by annotator q q q q q q q q q q q 0 50 100 annotator score q q q q q q q q q q q q q q 0 50 100 konstas zhu ribeiro guo manning system score (a) Fluency by system q q q q q q q q q q q 0 50 100 konstas zhu ribeiro guo manning system score (b) Adequacy by system tator who did four batches.The result was that each set of sentences was double-annotated, allowing us to quantify inter-annotator agreement.Additionally, batches were assigned such that each annotator overlapped with at least two other annotators.

Survey Reliability
Pilot: The only previous human evaluation of several AMR-to-English generation systems was in the SemEval-2017 task discussed above.Since our survey had several differences from this previous evaluation, it was possible that the methodological differences could lead to substantial differences in judgments on the same data.Thus, before conducting the main survey, we validated our methodology by comparing the results of the pilot survey to that of the SemEval-2017 evaluation.This is the first evaluation of AMR generation to collect separate judgments for fluency and adequacy.We hypothesized that this would provide a finer-grained characterization of system behavior, and that annotators would be able to distinguish these two scales, though they are related (incomprehensible sentences necessarily have low fluency as well as accuracy, while references and high-quality output have near-perfect fluency and adequacy).
Indeed, we find a Spearman's rank correlation of 0.68 between fluency and adequacy ratings in the pilot, indicating that while they are related, annotators were largely able to evaluate these two dimensions separately.
The average fluency scores from our evaluation match the ranking of systems found in May and Priyadarshi (2017).Average adequacy scores are the same except that ISI performs slightly higher than FORGe.This suggests that our methodology is reliable for ranking systems, and that separating judgments for fluency and adequacy allows for a more nuanced view of relative system performance than overall quality judgments.
Finally, we calculate inter-annotator agreement (IAA) to measure how consistently annotators could make these judgments.We measure IAA for the numeric fluency and adequacy scores with Spearman's correlation, and for each adequacy error type with Cohen's Kappa.
We find an average pairwise IAA of 0.78 for fluency and 0.67 for adequacy.For error types, we get lower agreement: average pairwise Kappa scores are 0.44 for incomprehensibility, 0.53 for missing information, and 0.28 for added information.This indicates that guidelines on when to annotate these error types were not made clear enough for annotators to apply them consistently; future studies using this methodology should clarify these guidelines for more reliable results.
Main Survey: On this survey we find an overall Spearman's correlation of 0.58 between fluency and adequacy, indicating that annotators were able to evaluate these two dimensions separately.This correlation is lower than in the pilot, which may be due to clearer instructions given to annotators on what is meant by "fluency" and "adequacy", or because the two dimensions are easier to separate when fewer sentences are of very low quality.
Since each set of 10 AMRs (or 60 judgments of each type per annotator) was double-annotated by a different pair of annotators, we evaluated IAA separately for each pair.Agreement scores vary considerably, but indicate moderate agreement overall.
Results are shown in table 1.We find that IAA for fluency is moderate to high for most annotator pairs, with two exceptions where agreement is low.IAA is higher for adequacy than for fluency in 8 out of 10 cases, and reflects at least moderate agreement in all cases.
For adequacy error types, IAA scores vary greatly and many are low.This indicates that guidelines given to annotators may not have been clear enough.For example, it was expected that annotators would infer, based on their knowledge that AMR does not specify tense, that sentences should not be considered wrong for having any particular tense; however, we learned after the evaluation that at least one annotator marked some cases of nonpresent tense in sentences as added information.
Figure 3 gives each annotator's distribution of ratings, showing that different individuals chose to distribute their judgments over the available 0-100 scale in different ways.Since each annotator judged each system the same number of times, this is not a problem for our comparison of systems.However, when identifying low-scoring sentences ( §4.4 and §4.5), we normalize by annotator to account for these differences.Table 3: Of 100 sentences, number with low fluency or adequacy (bottom 1 3 of both annotators' scores).

Quality of Systems
Table 2 shows the average score given for each system for fluency and adequacy, as well as how often each was marked as having each adequacy error type.We find that on both fluency and adequacy scores, Konstas performs best, followed by Zhu, and Manning performs the worst.Guo and Ribeiro are in between and within 5 points of each other on each measure, with Ribeiro performing better on fluency and Guo on adequacy.Unsurprisingly, the lower a system's average fluency score, the more often sentences were marked as incomprehensible.
The Missing Information and Added Information labels support the suggestion of Manning (2019) that although their system performs worse than others by most measures, its constraints make it less likely than machine-learning-based systems to omit or hallucinate information.Konstas's system performs the next-best by both of these measures; in particular, it rarely adds information not present in the AMR.Ribeiro's system is most prone to errors of these types, omitting information in nearly half of sentences and hallucinating it in nearly a third.Overall, the results from these questions indicate that neural AMR generation systems are prone to omit or hallucinate concepts from the AMR with concerning frequency.
Figures 4a and 4b show the distributions of scores each system received for fluency and adequacy, respectively.1These show that Konstas is skewed toward very high scores, and that Manning skews toward low scores especially for fluency.

Comparison to Automatic Metrics
To investigate how well automatic metrics align with human judgments of the relative quality of these systems, we compute BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), TER Table 4: Each system's scores on automatic metrics for the full dataset of 1371 sentences (All) and the subset of 100 sentences used in the human evaluation (Sub.).(Snover et al., 2006), and CHRF++ (Popović, 2017), and BERTScore (Zhang et al., 2020) for each system.2Results are shown in table 4; the relationship between each system's average fluency and adequacy scores to its BLEU score and BERTScore are also visualized in figure 5.
All these metrics at least agree with humans that the Konstas and Zhu systems are the best, followed by Ribeiro and Guo, and that Manning is the worst.
Within the top two, humans found Konstas substantially better than Zhu.When using the full data, all automatic metrics agree that Konstas is best, although for all but CHRF++ this is by a small margin.When evaluated only on sentences used in the human evaluation, only METEOR, CHRF++, and BERTScore preserve this ranking; BLEU finds the two essentially tied, while TER finds Zhu slightly better.
For the middle two, humans preferred Ribeiro on fluency but preferred Guo on adequacy.On the full dataset, all the metrics capture these systems are of very similar overall quality, varying only by a fraction of a point.On the subset of sentences, all metrics except BERTScore prefer Ribeiro, suggesting that these metrics may align more with human judgments of fluency than of adequacy.
Overall, these results show that these metrics essentially capture human rankings of these systems on this dataset, although further research would be needed to more robustly confirm the validity of these metrics for the task.
The results also highlight the limitations of metrics that produce only single scores.While these metrics can only capture that the Ribeiro and Guo systems are similar, our human evaluation found more nuance by identifying criteria on which each one outperforms the other.

Analysis of Adequacy Errors
To examine what factors contributed to particularly low adequacy scores, we identify sentences for which both annotators gave low ratings.Because, as shown in figure 3, individual annotators differed in the distribution of ratings they used, we normalized this by annotator: a sentence is counted as low-adequacy if each annotator gave it a rating in the lower 1/3 of their total adequacy ratings.The number of low-scoring sentences by system is given in table 3.
All 139 low-adequacy sentences were marked as having at least one adequacy error by at least one annotator.46 (33%) were tagged by both annotators as incomprehensible, 51 (37%) as missing information, and 25 (18%) as adding information.
Added information is perhaps the most troubling form of error; AMR generation systems will have severely limited potential for use in practical applications as long as they hallucinate meaning.In one example, a reference to prostitution is inserted: As seen above in table 2, Manning omits and adds information substantially less often than the other systems, but produces incomprehensible sentences far more often.Thus, it is unsurprising that most (73%) of its low-adequacy sentences are also low-fluency.For Guo, too, a majority (54%) of low-adequacy sentences are low-fluency, though this is largely due to anonymization and repetition of words, as discussed below.

Analysis of Fluency Errors
Using the same procedure described above for low adequacy, we also identify sentences for which both annotators gave low fluency ratings.Counts for each system are given in table 3.As expected, no reference sentences are low-fluency.
Of the 116 low-fluency sentences, 50 (43%) are also marked as incomprehensible by both annotators.The other error types are, unsurprisingly, less related to low fluency than to low adequacy: 23 (20%) of low-fluency sentences are missing information, and only 6 (5%) have added information.
Over half of all low-fluency sentences are from Manning's rule-based system.This is largely because in many cases the system's rules do not allow for the generation of function words that would be expected in a fluent version of the sentence, while the neural systems are more likely to include such words in similar ways to the training data.For example, for the following AMR: For the neural systems, common sources of low fluency scores included anonymization and repetition of words.Anonymization was a problem primarily for Guo; 9 of Guo's 21 low-fluency sentences contain the token <unk> in place of lowerfrequency words.For example, for the AMR in §1, 'annexation' is lost: GUO: georgia labels russia 's support for the <unk> act .
While Konstas uses anonymization less frequently, 2 of the system's 5 low-fluency sentences contain anonymized location names or quantities.
Guo, Ribeiro, and Konstas all have several low-fluency sentences with unhumanlike repetition of words or phrases, for example:

Conclusion and Future Work
Our analysis of these systems, and especially of their common errors, points toward directions for researchers developing NLG systems, especially for AMR, to improve their output.We recommend attempting to find solutions to the common issues that led to low scores even from state-of-the-art systems, such as anonymization of infrequent concepts, unnecessary repetition of words, and hallucination.
While this study found that popular automatic metrics were mostly successful in ranking these systems in the same order human annotators did, we also found that the human evaluation was able to identify strengths and weaknesses of systems with more nuance than a single number can convey.We also acknowledge that, given prior work pointing to the inadequacy of metrics such as BLEU for NLG and AMR generation, more research is needed to determine the reliability of these metrics for comparing systems.We suggest that researchers in AMR generation and other NLG tasks continue to supplement automatic metrics with human evaluation as much as possible.

Figure 1 :
Figure 1: Screenshot from the fluency section of the survey.

Figure 2 :
Figure 2: Screenshot from the adequacy section of the survey.

Figure 3 :
Figure 3: Violin plots of ranges of human judgments for each annotator.

Figure 4 :
Figure 4: Violin plots of human judgments for each system.
Comparison of BERT scores to average adequacy scores.

Figure 5 :
Figure 5: Comparison of BLEU and BERT scores to human judgments.
REF: A high-security Russian laboratory complex storing anthrax, plague and other deadly bacteria faces loosing electricity for lack of payment to the mosenergo electric utility.RIBEIRO: the russian laboratory complex as a high -security complex will be faced with anthrax , prostitution , and and other killing bacterium losing electricity as it is lack of paying for mosenergo .
system gave the disfluent output 'Thank you read .' while others produced variants of 'thank you for reading .' or 'thanks for reading .' l3 / large)))))) RIBEIRO: and i happen to like a large lot of a lot .

Table 1 :
Inter-annotator agreement scores for each annotator pair, with averages in the final row.For numeric ratings of Fluency (F) and Adequacy (A), we use Spearman's Rho; for binary categorical ratings of Incomprehensibility (INC), Missing Information (MI), and Added Information (AI), we use Cohen's Kappa.

Table 2 :
For each system, average fluency and adequacy scores and percentage where each adequacy error type was selected.Scores for the reference sentences are included for comparison.