The Second Multilingual Surface Realisation Shared Task (SR’19): Overview and Evaluation Results

We report results from the SR’19 Shared Task, the second edition of a multilingual surface realisation task organised as part of the EMNLP’19 Workshop on Multilingual Surface Realisation. As in SR’18, the shared task comprised two tracks with different levels of complexity: (a) a shallow track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (b) a deep track where additionally, functional words and morphological information were removed. The shallow track was offered in eleven, and the deep track in three languages. Systems were evaluated (a) automatically, using a range of intrinsic metrics, and (b) by human judges in terms of readability and meaning similarity. This report presents the evaluation results, along with descriptions of the SR’19 tracks, data and evaluation methods. For full descriptions of the participating systems, please see the separate system reports elsewhere in this volume.


Introduction and Task Overview
Following the success of the First Multilingual Surface Realisation Shared Task in 2018 (SR'18), which had the goal to stimulate the exploration of advanced neural models for multilingual sentence generation from Universal Dependency (UD) structures, 1 the second edition of the task (SR'19) aims to build on last year's results and achieve further progress. While Natural Language Generation (NLG) has been gaining increasing attention from NLP researchers, it continues to be a smaller field than e.g. parsing, text classification, sentiment analysis, etc. Universal dependencies are also enjoying increasing attention: the number of UD treebanks is continuously 1 http://universaldependencies.org/ growing, as is their size (and thus the volume of available training material). 2 The SR tasks require participating systems to generate sentences from structures at the level of abstraction of outputs produced by state-of-the-art parsing. In order to promote linkage with parsing and earlier stages of generation, participants are encouraged to explore the extent to which neural network parsing algorithms can be reversed for generation. As was the case with its predecessor tasks SR'11  and SR'18 (Mille et al., 2018), SR'19 comprises two tracks distinguished by the level of specificity of the inputs: Shallow Track (T1): This track starts from UD structures in which most of the word order information has been removed and tokens have been lemmatised. In other words, it starts from unordered dependency trees with lemmatised nodes that hold PoS tags and morphological information as found in the original treebank annotations. The task in this track therefore amounts to determining the word order and inflecting words.
Deep Track (T2): This track starts from UD structures from which functional words (in particular, auxiliaries, functional prepositions and conjunctions) and surface-oriented morphological and syntactic information have additionally been removed. The task in the Deep Track thus also involves reintroduction of functional words and morphological features, in addition to what is required for the Shallow Track.
The training and development data for both tracks and the evaluation scripts were released on April 5 th 2019, the training data on August 3 rd 2019 and the outputs were collected two weeks later on August 19 th ; the teams had up to 4 months to de-velop their systems. 3 Compared to SR'18, SR'19 has a broader variety of languages hence even more emphasis on multilinguality, with 11 languages from 9 language families: 4 Arabic (Afro-Asiatic), Chinese (Sino-Tibetan), English (Germanic), French, Portuguese and Spanish (Italic), Hindi (Indo-Iranian), Indonesian (Austronesian), Japanese (Japonic), Korean (Koreanic) and Russian (Balto-Slavic). This reflects a trend in NLP towards taking into account increasing numbers of languages for the validation of developed models; see e.g., SIGMORPHON 2019, which addressed crosslingual inflection generation in 100 language pairs. 5 In the remainder of this paper, we describe the Shallow and Deep Track data (Section 2), and the evaluation methods we used to evaluate submitted systems (Sections 3.1 and 3.2). We then introduce the participating systems briefly (Section 4), report and discuss evaluation results (Section 5), and conclude with some discussion and a look to the future (Section 6).

Data 2.1 Overview of datasets and additional resources
In order to create the SR'19 training, development and test sets, we used as data sources 20 UD treebanks 6 for which annotations of reasonable quality were available, providing PoS tags and morphologically relevant markup (number, tense, verbal finiteness, etc.). Unlike in SR'18, several treebanks were available for some languages, enabling us to use out-of-domain as well as silver standard datasets as additional test data (for details see Section 2.3). Table 1 gives an overview of the variety and sizes of the datasets. Teams were allowed to build models trained on any SR'19 dataset(s) of their choice, but not external task-specific data. Other resources were, however, permissible. For example, available parsers such as UUParser (Smith et al., 2018) could be run to create a silver standard versions of provided datasets and use them as additional or alternative training material, and publicly available off-the-3 In the case of one team, we agreed to move the two week window between test data release and submission to one week earlier. 4 At SR'18, there were ten languages from five families. 5 https://www.aclweb.org/portal/ content/sigmorphon-shared-task-2019 6 universaldependencies.org shelf language models such as GPT-2 (Radford et al., 2019), ELMo (Peters et al., 2018), polyglot (Al-Rfou et al., 2013) or BERT (Devlin et al., 2018) could be fine-tuned with publicly available datasets such as WikiText (Merity et al., 2016) or the DeepMind Q&A Dataset (Hermann et al., 2015). Datasets were created for 11 languages in the Shallow Track, and for three of those languages, namely English, French and Spanish, in the Deep Track. As in 2018, Shallow Track inputs were generated with the aid of Python scripts from the original UD structures, this time using all available input sentences. Deep Track inputs were then generated by automatically processing the Shallow Track structures using a series of graphtransduction grammars covering steps 5-11 in Section 2.2 below. In the training data, there is a node-to-node correspondence between the deep and shallow input structures, and they are both aligned with the original UD structures. We used only information found in the UD syntactic structures to create the deep inputs, and tried to keep their structure simple. Moreover, words were not disambiguated, full prepositions may be missing, and some argument relations may be underspecified or missing.
Structures for both Shallow and Deep Tracks are trees, and are released in a slightly modified CoNLL-U format, comprising the following ten columns:  Figure 1 shows a sample original UD annotation for English; the corresponding shallow and deep input structures derived from it are shown in Figures 2 and 3, respectively (the last two columns are empty for the task).

Task data creation
To create the data for the Shallow Track, the original UD data was processed as follows: 1. Word order information was removed by randomised scrambling, but in the training data, the alignment with the original position of each word in the sentence was maintained via a feature in the FEATS column;  (ar)  T1  6,075  909  680  chinese gsd (zh)  T1  3,997  500  500  english ewt (en)  T1, T2 12,543 2,002 2,077  english gum (en)  T1, T2 2,914  707  778  english lines (en)  T1, T2 2,738  912  914  english partut (en)  T1, T2 1,781  156  153  french gsd (fr)  T1, T2   For the Deep Track, the following steps were additionally carried out: 5. Edge labels were generalised into predicate/argument labels, in the Prop-Bank/NomBank (Palmer et al., 2005;Meyers et al., 2004) fashion. That is, the syntactic relations were mapped to core (A1, A2, etc.) and non-core (AM) labels, applying the following rules: (i) the first argument is always labeled A1 (i.e. there is no external argument A0); (ii) in order to maintain the tree structure and account for some cases of shared arguments, there can be inverted argument relations; (iii) all modifier edges are assigned the same generic label AM; (iv) there is a coordinating relation. See also the inventory of relations in Table 2. 6. Functional prepositions and conjunctions in argument position (i.e. prepositions and conjunctions that can be inferred from other lexical units or from the syntactic structure) were removed (e.g. about and that in Figure 2); prepositions and conjunctions retained in the deep representation can be found under a A2INV dependency; a dependency path Gov AM → Dep A2INV → Prep is equivalent to a predicate (the conjunction/preposition) with 2 arguments: Gov ← A1 Prep A2 → Dep.
7. Definite and indefinite determiners, auxiliaries and modals were converted into at-  tribute/value pairs, as were definiteness features, and the universal aspect and mood features 9 , see examples in Figure 3.
8. Subject and object relative pronouns directly linked to the main relative verb were removed (instead, the verb was linked to the antecedent of the pronoun); a dummy pronoun node for the subject was added if an originally finite verb had no first argument and no available argument to build a passive; for a pro-drop language such as Spanish, a dummy pronoun was added if the first argument was missing.
9. Surface-level morphologically relevant information as prescribed by syntactic structure or agreement (such as verbal finiteness or verbal number) was removed, whereas semanticlevel information such as nominal number and verbal tense was retained.
10. Fine-grained PoS labels found in some treebanks (see e.g. column 5 in Figure 2) were removed, and only coarse-grained ones were retained (column 4 in Figures 2 and 3).
11. In the training data, the alignments with the tokens of the Shallow Track structures were added in the FEATS column.

Additional test data
For additional test data, we used automatically produced UD parses, which we then processed in the same way as the gold-standard structures, using the best parsers from the CoNLL'18 shared task on the dataset in question. 10 We used the UD2.3 version of the dataset, whereas CoNLL'18 used UD2.2; we selected treebanks that had not undergone major updates from one version to the next according to their README files on the UD site, and for which the best available parse reached a Labeled Attachment Score of 85 and over. 11 There were datasets meeting these criteria for English (2), Hindi, Korean, Portuguese and Spanish; the Harbin HIT-SCIR parser (Che et al., 2017) had best scores on four of these datasets; LATTICE (Lim et al., 2018) and Stanford (Qi et al., 2019) had the best scores for the remaining two; 12 see Table 3 for an overview.
As is the case for all test data, in the additional automatically parsed test data alignments with surface tokens and with Shallow Track tokens are not provided; however, in the cases described in 4 above, the relative order is provided.

Data formats for evaluations
Unlike in SR'18, where detokenised outputs only were used, the SR'19 teams were asked to provide tokenised (for automatic evaluations) as well as detokenised (for human evaluations) outputs; if no detokenised outputs were provided, the tokenised files were also used for the human evalu- 11 The best score on the English-EWT dataset is slightly below this threshold (84.57), but the dataset was selected anyway because English was expected to be the language most addressed by the participants. 12 The CoNLL'18 shared task submissions were downloaded from https://lindat.mff.cuni.cz/ repository/xmlui/handle/11234/1-2885.   ation. The reason for using tokenised outputs for automatic evaluation is the inclusion of languages like Chinese and Japanese where sentences are sequences of characters with no white-space separators. Two of the metrics used in automatic evaluations, BLEU and NIST, compute scores based on matching sequences of characters; if there is no whitespace, the whole sentence is the sequence that is used for matching. As a result, one single different character in a sentence would prevent a match with the reference sentence, and a null score would be assigned to the whole sentence. The following example shows a Spanish sentence in its tokenised and detokenised forms: • Tokenised sample (Spanish): All tokens are preceded by a white space. Elías Jaua , miembro del Congresillo , considera que los nuevos miembros del CNE deben tener experiencia para " dirigir procesos complejos " .
• Detokenised sample (Spanish): White spaces before or after some punctuation signs are removed.
In the original UD files, the reference sentences are by default detokenised. In order to carry out the evaluations of the tokenised outputs, we built a tokenised version of the reference sentences by concatenating the words of the second column of the UD structures (see Figure 1) separated by a whitespace.

Automatic methods
We used BLEU, NIST, and inverse normalised character-based string-edit distance (referred to as DIST, for short, below) to assess submitted systems. BLEU (Papineni et al., 2002) is a precision metric that computes the geometric mean of the n-gram precisions between generated text and reference texts and adds a brevity penalty for shorter sentences. We use the smoothed version and report results for n = 4. NIST 13 is a related n-gram similarity metric 13 http://www.itl.nist.gov/iad/mig/ tests/mt/doc/ngram-study.pdf; http:// www.itl.nist.gov/iad/mig/tests/mt/2009/ weighted in favor of less frequent n-grams which are taken to be more informative.
DIST starts by computing the minimum number of character inserts, deletes and substitutions (all at cost 1) required to turn the system output into the (single) reference text. The resulting number is then divided by the number of characters in the reference text, and finally subtracted from 1, in order to align with the other metrics. Spaces and punctuation marks count as characters; output texts were otherwise normalised as for all metrics (see below).
The figures in the tables below are the systemlevel scores for BLEU and NIST, and the mean sentence-level scores for DIST.
Text normalisation: Output texts were normalised prior to computing metrics by lowercasing all tokens, removing any extraneous whitespace characters.
Missing outputs: Missing outputs were scored 0. We only report results for all sentences (incorporating the missing-output penalty), rather than also separately reporting scores for just the incoverage items.
Important note: The SR'19 scores are not directly comparable to the SR'18 ones, since the SR'18 scores were calculated on detokenised outputs, whereas the scores presented in this report were calculated on tokenised outputs (see Section 2.4). In addition, the method for calculating the DIST score in SR'18 was different in that it did not take into account the whole sentence. 14

Human-assessed methods
For the human evaluation, we selected a subset of language/track combinations based on number of submissions received and availability of evaluators: four Shallow Track in-domain datasets (Chinese-GSD, English-EWT, Russian-SynTagRus, Spanish-AnCora), one Shallow Track dataset coming from parsed data (Spanish-AnCora HIT ) and one (in-domain) Deep Track dataset (English-EWT).
As in SR'11  and SR'18 (Mille et al., 2018), we assessed two quality criteria in the human evaluations, in separate evaluation experiments, Readability and Meaning Similarity, and used continuous sliders as rating tools, the evidence being that raters tend to prefer them . Slider positions were mapped to values from 0 to 100 (best). Raters were first given brief instructions, including the direction to ignore formatting errors, superfluous whitespace, capitalisation issues, and poor hyphenation. The statement to be assessed in the Readability evaluation was: The text reads well and is free from grammatical errors and awkward constructions.
The corresponding statement in the Meaning Similarity evaluation, in which system outputs ('the black text') were compared to reference sentences ('the grey text'), was as follows: The meaning of the grey text is adequately expressed by the black text.
Slider design: As in SR'18, and for conformity with what has emerged as an affordable human evaluation standard over the past three years in the main machine translation shared tasks held at WMT (Bojar et al., 2017(Bojar et al., , 2018Barrault et al., 2019), we used a slider design as follows, with the pointer starting at 0: Mechanical Turk evaluations: As in SR'18, we ran human evaluation on Mechanical Turk using Direct Assessment (DA) (Graham et al., 2016), the human evaluation used at WMT campaigns to produce official ranking of machine translation systems (Barrault et al., 2019). We ran both meaning similarity and readability evaluations, as separate assessments, but using the same method.
Quality assurance: System outputs are randomly assigned to HITs (following Mechanical Turk terminology) of 100 outputs, of which 20 are used solely for quality assurance (QA) (i.e. do not count towards system scores): (i) some are repeated as-is, (ii) some are repeated in a 'damaged' version and (iii) some are replaced by their corresponding reference texts. In each case, a minimum threshold has to be reached for the HIT to be accepted: for (i), scores must be similar enough, for (ii) the score for the damaged version must be worse, and for (iii) the score for the reference text must be high. For full details of how these additional texts are created and thresholds applied, please refer to Barrault et al. (2019). We report QA figures for the MTurk evaluations below.
Test data sets for human evaluations: Test set sizes out of the box varied for the different languages. For the human test sets we selected either the entire set or a subset of approximately 500, whichever was the smaller number, for a given language, motivated by the power analysis provided by Graham et al. (2019). For subsets, test set items were selected randomly.
Reported scores: In keeping with the WMT approach, we report both average raw scores and average standardised scores per system. In order to produce standardised scores we simply map each individual evaluator's scores to their standard scores (or z-scores) computed on the set of all raw scores by the given evaluator using each evaluator's mean and standard deviation. For both raw and standard scores, we compute the mean of sentence-level scores.
Code: We were able to reuse, with minor adaptations, the code produced for the WMT'17 evaluations. 15

Overview of Submitted Systems
ADAPT is a sequence to sequence model with dependency features attached to word embeddings. A BERT sentence classifier was used as a reranker to choose between different hypotheses. The implementation is very similar to ADAPT's SR'18 submission (Elder and Hokamp, 2018).
The BME-UW system (Kovács et al., 2019) learns weighted rules of an Interpreted Regular Tree Grammar (IRTG) to encode the correspondence between word sequences and UDsubgraphs. For the inflection step, a standard sequence-to-sequence model with a biLSTM encoder and an LSTM decoder with attention is used.
CLaC (Farahnak et al., 2019) is a pointer network trained to find the best order of the input. A slightly modified version of the transformer model was used as the encoder and decoder for the pointer network.
The CMU (Du and Black, 2019) system uses a graph neural network for end-to-end ordering, and a character RNN for morphology.
DepDist (Dyer, 2019) uses syntactic embeddings and a graph neural network with message passing to learn the tolerances for how far a dependent tends to be from its head. These directed dependency distance tolerances form an edgeweighted directed acyclic graph (DAG) (equivalent to a partially ordered set, or poset) for each sentence, the topological sort of which generates a surface order. Inflection is addressed with regex patterns and substitutions approximating productive inflectional paradigms.
The DipInfoUnito realiser (Mazzei and Basile, 2019) is a supervised statistical system for surface realisation, in which two neural network-based models run in parallel on the same input structure, namely a list-wise learning to rank network for linearisation and a seq2seq network for morphology inflection prediction.
IMS (Yu et al., 2019) uses a pipeline approach for both tracks, consisting of linearisation, completion (for T2 only), inflection, and contraction. All models use the same bidirectional Tree-LSTM encoder architecture. The linearisation model orders each subtree separately with beam search and then combines them into a full projective tree; the completion model generates absent function words in a sequential way given the linearised tree of content words; the inflection model predicts a sequence of edit operations to convert the lemma to word form character by character; the contraction model predicts BIO tags to group the words to be contracted, and then generate the contracted word form of each group with a seq2seq model.
The LORIA submission (Shimorina and Gardent, 2019) presents a modular approach to surface realisation with three subsequent steps: word ordering, morphological inflection, and contraction generation (for some languages). For word ordering, the data is delexicalised, the input tree is linearised, and the mapping between an input tree and output lemma sequence is learned using a factored sequence-to-sequence model. Morphological inflection makes use of a neural characterbased model, which produces word forms based on lemmas coupled with morphological features; finally, a rule-based contraction generation module is applied for some languages.
The OSU-FB pipeline for generation (Upasani et al., 2019) starts by generating inflected word forms in the tree using character seq2seq models. These inflected syntactic trees are then linearised as constituent trees by converting the relations to non-terminals. The linearised constituent trees are fed to seq2seq models (including models with copy and with tree-LSTM encoders) whose outputs also contain tokens marking the tree structure. N-best outputs are obtained for orderings and the highest confidence output sequence with a valid tree is chosen (i.e, one where the input and output trees are isomorphic up to sibling order, ensuring projectivity).
The RALI system (Lapalme, 2019) uses a symbolic approach to transform the dependency tree into a tree of constituents that is transformed into an English sentence by an existing English realiser, JSrealB (Molins and Lapalme, 2015). This realiser was then slightly modified for the two tracks.
Surfers (Hong et al., 2019) first performs delexicalisation to obtain a dictionary for proper names and numbers. A GCN is then used to encode the tree inputs, and an LSTM encoder-decoder with copy attention to generate delexicalised outputs. No part-of-speech tags, universal features or pretrained embeddings / language models are used.
The Tilburg approach (Ferreira and Krahmer, 2019), based on Ferreira et al. (2018), realises multilingual texts by first preprocessing an input dependency tree into an ordered linearised string, which is then realised using a rule-based and a statistical machine translation (SMT) model.
Baseline: In order to set a lower boundary for the automatic and human evaluations, a simple English baseline consisting of 7 lines of python code was implemented 16 . It generates from a UD file with an in-order traversal of the tree read by pyconll and outputting the form of each node.

Evaluation results
There were 14 submissions to the task, of which two were withdrawn; 9 teams participated in the Shallow Track only, two teams participated in both tracks, and one team in the Deep Track only. For the Shallow Track, four teams (BME, IMS, LO-RIA and Tilburg) generated outputs for all languages (29 datasets), four teams (ADAPT, CLaC, RALI and OSU-FB) submitted only for the English datasets, and three teams (CMU, DepDist and DipInfo-UniTo) submitted in several but not all languages. For the Deep Track, two of the three teams (IMS, Surfers) addressed all languages (13 datasets), and one team (RALI) addressed English only. IMS is the only team to have submitted results for all 42 datasets.    In the Shallow Track, 8 out of the 11 systems scored 59 BLEU and above on the English-EWT dataset, and three systems achieved a BLEU score of about 80, the highest score being obtained by IMS with 82.98. High scores were also achieved for Spanish, Hindi, Indonesian, French and Chinese (58 BLEU and above on average).
Evaluations of the out-of-domain datasets (PUD) for English and Japanese generally yielded higher scores than those of the in-domain datasets, whereas the opposite is true for Russian. This may be because of the type of language in the different datasets: for instance, the PUD data contains news and Wikipedia texts, i.e. rather cleanly written texts, while the English-EWT corpus contains customer reviews, blog and forum posts, in which a wider variety of language use can be found. Sentences such as Fun picture websites (:? or in n out of the chicago area? are expected to be generated but are more difficult to predict; for instance, the IMS outputs for these two sentences are In a out of the chicago area? and (: fun picture websites?. In this case the type of language used seems to have more impact than the fact that the domains are different. On the other hand, the Russian--T2-  SynTagRus and Russian-PUD datasets both contain mostly news texts, so the structures to generate are more similar; in this context, the impact of the change of domain becomes visible.
The results on the automatically parsed datasets are in general very similar to the results on datasets that originate from gold-standard annotations. For English-EWT HIT , all scores are slightly lower than the English-EWT scores, with no more than 2 BLEU points, 0.3 NIST points and 2.5 DIST points difference. For the English-PUD LAT , the difference is more pronounced, up to 6 BLEU points lower e.g. for BME-UW. However, for the other four datasets, most scores are higher, with improvements up to 2 BLEU points; the exceptions to this trend are IMS on the Hindi data and BME-UW on the Korean-Kaist data, for which the scores according to the three metrics are slightly below scores for gold-standard data.
For the Deep Track datasets, scores are generally substantially lower than for the Shallow Track datasets. The trends observed for the generation from automatically parsed data are confirmed, but the out-of-domain scores for English (the only language with an out of domain dataset in the Deep Track) are lower than the in-domain ones, which could be due in particular to the difficulty of generating punctuation signs.
Finally, the Lower Bound (LB) baseline system results are, as expected, very low (they are not shown in the tables): on the two datasets that are part of the human evaluation, i.e. the T1 and T2 English-EWT, it obtained 7.62 BLEU, 8.26 NIST, 37.99 DIST, and 1.31 BLEU, 4.8 NIST, 35.13 DIST, respectively. Tables 8 and 9 show the results of the human evaluation carried out via Mechanical Turk with Direct Assessment (MTurk DA) for English, Chinese, Russian and Spanish, respectively. See Section 3.2 for details of the evaluation method. 'DA' refers to the specific way in which scores are collected in the WMT approach which follows the evaluation approach of SR'18 but differs from what was done for SR'11.

Results of the human evaluation
English: For human evaluation of systems for both the Shallow (T1) and Deep (T2) Tracks, outputs were combined into a single dataset prior to being evaluated and results for all systems are shown in Tables 8 and 9. Average Meaning Similarity DA scores for the Shallow Track for English systems range from 86.6% to 55.3% with ADAPT and IMS achieving highest overall scores in terms of both average raw DA scores and corresponding z-scores. In order to investigate how Readability of system outputs compares to that of human-produced text, we included the original test sentences as a 'system' in the Readability evaluation. Unsurprisingly, human text achieves the highest score in terms of Readability (71.1%) but is closely followed by the best performing systems in terms of Readability, IMS (67.9%) and ADAPT (68.2%), both tied with human readability (and one another) in terms of statistical significance.
In the Deep Track for English, IMS achieved highest results in terms of Meaning Similarity (80.6%), significantly higher than all other systems participating in the Deep Track. In terms of Readability, IMS (61.9%) is tied, in terms of sta-  tistical significance, with Surfers (60.9%). 17 Finally, note that for both Meaning Similarity and Readability, as expected, the Lower Bound Baselines are tied at the last rank with significantly lower scores than the other systems.
Russian : Tables 8 and 9 show average DA scores for systems participating in the Russian task. Meaning Similarity scores for Russian systems range from 88.3% to 77.5% with IMS again achieving highest overall score. In terms of Readability, again IMS achieves the highest average score of 84.1%. Compared to the human results, 17 We tested for statistical significance of differences between average DA scores using a Wilcoxon rank sum test. there is a larger gap than that observed for English outputs, with the best system, IMS, still significantly lower than human performance in terms of Russian readability.
Spanish UD: Tables 8 and 9 show average DA scores for systems participating in Spanish UD. Meaning Similarity scores range from 81.1% to 63.2%, with IMS achieving the highest score, significantly higher than all other participating teams. In terms of Readability, the text produced by the systems ranges from 86.5% to 60.6%, and again IMS achieves the highest score, again significantly higher than all other systems. No system achieves human performance here either, as the human ref-  erences achieve a significantly higher score than all systems in terms of readability.
Spanish Automatically Parsed ('Pred. Spanish' in the tables): Tables 8 and 9 show average DA scores for system outputs for the Spanish automatically parsed data. Meaning Similarity scores range from 82.7% to 59.2%, with IMS achieving the highest score, significantly higher than all other participating teams. IMS and CMU achieve better scores than on the regular Spanish UD dataset, while the other systems score lower.
In terms of Readability, the text produced by the systems ranges from 82.8% to 53.8%, and again IMS achieves the highest score, again significantly higher than all other systems. But for the automatically parsed data, all systems score lower than on the Spanish UD dataset, showing that whereas there was no clear difference between the two datasets according to the automatic metrics, the human evaluation shows that the systems do not manage to generate texts with the same quality.
Chinese: Tables 8 and 9 show average DA scores for all participating systems. Meaning Similarity scores range from 83% to 67%, with IMS achieving the highest score, significantly higher than all other participating teams. In terms of Readability, the produced text ranges from 68.2% to 39.1%, and again IMS achieves the highest score, again significantly higher than all other systems. As for the other non-English languages, no system achieves human performance.
Results from MTurk DA quality control: Similar to SR'18, only 31% of workers passed quality control (being able to replicate scores for same sentences and scoring damaged sentences lower), again highlighting the danger of crowd-sourcing without good quality control measures. The remaining 69%, who did not meet this criterion, were omitted from computation of the official DA results above. Such levels of low quality workers are consistent with what we have seen in DA used for Machine Translation (Graham et al., 2016) and Video Captioning evaluation . Table 10 shows the Pearson correlation of BLEU, NIST and DIST scores with human assessment for systems in tasks for which we ran human evaluations this year. These were computed on the average z scores. While BLEU is the metric that correlates best with the human judgements in general, NIST and DIST are more erratic. None of the automatic metrics correlate well with human judgements of Readability on the English Deep Track data ('English T2' in the tables), in particular NIST with only 0.15. This contrasts with corresponding correlations for Meaning Similarity which do not appear to be affected. Combined with the fact that human assessment scores the deep systems higher for Readability than the metrics, this indicates that some deep systems are producing fluent text that is however dissimilar to the reference texts. The correlations for T2 should be interpreted cautiously since only four T2 systems are being evaluated, which possibly distorts the numbers.

Conclusion
The 2019 edition of the SR task (SR'19) saw increased language coverage (11 languages from 9 language families, up from 10 languages in 5 families), as well as increased participation (33 team registrations from 17 countries, up from 21 registrations for SR'18), with 14 teams submitting systems to SR'19 (up from 8 in SR'18). Datasets, evaluation scripts, system outputs and more about the task can be found on the GenChal repository. 18 Among the notable trends we can observe in evaluations are the following: (i) the best Shallow Track English systems are closing the gap to human-written texts in terms of human evaluation of Readability; (ii) there is a notable gap between human assessment (higher) and metric assessment (lower) of deep track systems, in particular for the best deep track systems; and (iii) the correlation between BLEU and human evaluations of both Readability and Meaning Similarity is consistently above 0.9 for outputs for the gold-standard shallow track datasets, but substantially lower for deep track systems (NIST and DIST are both more erratic).
The biggest progress has been made in SR'19 for deep track systems: not only did we have multiple Deep Track systems to evaluate (compared to just one in 2018), but the best Deep Track system performed equally well or better than most Shallow Track systems for both Readability and Meaning similarity.
Another notable development has been the introduction of silver-standard data. Even though the quality of the texts obtained when generating from automatically parsed data is lower than when using gold-standard data, the high scores according to human evaluations suggest that the shallow inputs could be used as pivot representations in text-to-text systems such as paraphrasing, simplification or summarisation applications.
Overall, the SR tasks have clearly demonstrated that generation from structured meaning representations can be done with impressive success by current neural methods. Given the increased interest and progress we have been able to report for SR'19, we plan to continue with a third shared task in 2020, as part of which we plan to investigate ways of linking up to earlier stages of automatic language generation.