Multilingual AMR-to-Text Generation

Generating text from structured data is challenging because it requires bridging the gap between (i) structure and natural language (NL) and (ii) semantically underspecified input and fully specified NL output. Multilingual generation brings in an additional challenge: that of generating into languages with varied word order and morphological properties. In this work, we focus on Abstract Meaning Representations (AMRs) as structured input, where previous research has overwhelmingly focused on generating only into English. We leverage advances in cross-lingual embeddings, pretraining, and multilingual models to create multilingual AMR-to-text models that generate in twenty one different languages. For eighteen languages, based on automatic metrics, our multilingual models surpass baselines that generate into a single language. We analyse the ability of our multilingual models to accurately capture morphology and word order using human evaluation, and find that native speakers judge our generations to be fluent.


Introduction
Generating text from structured data has a variety of applications in natural language processing.Tasks such as decoding from tables (Lebret et al., 2016;Sha et al., 2018), question answering from knowledge bases (Fan et al., 2019a), and generation from RDF Triples (Gardent et al., 2017), knowledge graphs (Marcheggiani and Perez-Beltrachini, 2018) and linguistic meaning representations (Konstas et al., 2017) face similar challenges: interpreting structured input and writing fluent output.We focus on generating from graph structures in the form of Abstract Meaning Representations (AMR) (Banarescu et al., 2013).Previous work has largely focused on generating from AMR into English, but we propose a multilingual approach that can decode into twenty one different languages.
Compared to multilingual translation, decoding from structured input has distinct challenges.Translation models take natural language input and must faithfully decode into natural language output.However, as shown in Zhao et al. (2020), bridging the gap between structured input and linear output is a difficult task.In addition, in structured input such as graphs, the input is usually semantically under-specified.For example, in AMRs, function words are missing and tense and number are not given.Thus, generation from structured input must bridge the gap between (i) structure and string and (ii) underspecified input and fully specified output.Multilinguality brings a third challenge -that of generating in languages that have varied morphological and word order properties.
Annotating natural language with AMR is a complex task and training datasets only exist for English1 , so previous work on AMR-to-text generation has overwhelmingly focused on English.We create training data for multilingual AMR-to-Text models, by taking the EUROPARL multilingual corpus and automatically annotating the English data with AMRs using the jamr semantic parser.We then use the English AMRs as the input for all generation tasks.To improve quality, we leverage recent advances in natural language processing such as cross-lingual embeddings, pretraining and multilingual learning.Cross-lingual embeddings have shown striking improvements on a range of cross-lingual natural language understanding tasks (Devlin et al., 2019;Conneau et al., 2019;Wu and Dredze, 2019;Pires et al., 2019).Other work has shown that the pre-training and fine-tuning approaches also help improve generation performance (Dong et al., 2019;Song et al., 2019;Lawrence et al., 2019;Rothe et al., 2019).Finally, multilingual models, where a single model 2 Related Work AMR-to-Text Generation.Initial work on AMR-to-text generation adapted methods from statistical machine translation (MT) (Pourdamghani et al., 2016), grammar-based generation (Mille et al., 2017), tree-to-string transducers (Flanigan et al., 2016), and inverted semantic parsing (Lampouras and Vlachos, 2017).Neural approaches explored sequence-to-sequence models where the AMR is linearized (Konstas et al., 2017) or modeled with a graph encoder (Marcheggiani and Perez-Beltrachini, 2018;Damonte and Cohen, 2019;Ribeiro et al., 2019;Song et al., 2018;Zhu et al., 2019).As professionally-annotated AMR datasets are in English, all this work focuses on English.
One exception is the work of Sobrevilla Cabezudo et al. (2019) which uses automatic translation to translate the English text of the LDC AMR data into Brazilian Portuguese and align English with the Portuguese translation to create Portuguese-centric AMRs.However, this work focuses only on one language.In contrast, we consider generation into twenty one languages.
We use very different methods and generate from English-centric AMRs, not target-language AMRs.
Multilingual MR-to-Text Generation.While work on AMR-to-Text generation has mostly focused on generation into English, the Multilingual Surface Realization shared tasks (Mille et al., 2018(Mille et al., , 2019) ) have made parallel MR/Text datasets available for 11 languages.Two tracks are proposed: a shallow track where the input is an unordered, lemmatized dependency tree and a deep track where the dependency tree edges are labelled with semantic rather than syntactic relations and where function words have been removed.
The participants approaches to this multilingual generation task use gold training data and mostly focus on the shallow track where the input is an unordered lemmatized dependency tree and the generation task reduces to linearization and morphological realization.The models proposed are pipelines that model each of these subtasks and separate models are trained for each target language (Kovács et al., 2019;Yu et al., 2019;Shimorina and Gardent, 2019a,b;Castro Ferreira and Krahmer, 2019).In this work, we focus instead on more abstract, deeper, input (AMRs) and propose end-toend, multilingual models for all target languages.

Method
To generate from AMRs, we use neural sequence to sequence models that model the input AMR with a Transformer Encoder and generate natural language with a Transformer Decoder.For all languages, the input is an English-centric AMR that was derived automatically using the jamr semantic parser from English text.We pre-train both the AMR encoder and the multilingual decoder and we leverage crosslingual embeddings.

Encoding English AMR
Abstract Meaning Representations are semantic representations that take the form of a rooted, directed acyclic graph.AMR abstracts away syntax such that sentences with similar meanings have similar AMR graphs.Full detail is not kept by the AMR -for example, elements such as verb tense are lost.While we focus on decoding from AMR input, the structured form is reflective of other structured inputs used in tasks such as generating from semantic role labels (Fan et al., 2019c) or RDF triples (Gardent et al., 2017).The AMR graph is first linearized into a sequence of tokens as shown in Figure 1 after preprocessing following (Konstas et al., 2017) (see Section 4.1 for a detailed description).Rather than model the graph structure directly, following Fan et al. (2019a), we model the graph using a graph embedding.The graph embedding provides additional information to the Transformer Encoder by encoding the depth of each node in the rooted graph and the subgraph each node belongs to.Concretely, each token has a word and position embedding, and additionally an indicator of depth calculated from the root and an indicator of which subtree the node belongs to (with all subtrees stemming from the root).These additional embeddings are concatenated to the word and position embeddings.Such information allows the Transformer Encoder to cap-ture some graph structure information, while still modeling a sequence.This is depicted in Figure 2.
To create a one-to-many multilingual model, we model a language embedding on the encoder side to allow the decoder to distinguish which language to generate into.This technique has been previously used in multilingual translation (Arivazhagan et al., 2019).The English AMR begins with a token that indicates the decoder side language.
To improve the quality of the encoder, we incorporate large-scale pretraining on millions of sequences of AMR by adopting the generative pretraining approach proposed in Lewis et al. (2019a).This pretraining incorporates various noise operations, such as masking (Devlin et al., 2019), span masking (Fan et al., 2019a), and shuffling.Previous work has shown that pretraining is effective for providing neural models with additional information about the structure of natural language and improving model quality (Dong et al., 2019;Song et al., 2019;Lawrence et al., 2019).As models increase in size, smaller training datasets (such as human-annotated AMR) are often not large enough to fully train these models.The entire encoder is pretrained on silver AMRs, as shown in Figure 2.

Multilingual Decoding from AMR
The Transformer Decoder attends to the encoded English AMR, a graph of concepts and relations, and generates text into many different languages with varied word order and morphology.
As displayed in Figure 2, we use both language model pretraining and crosslingual embeddings to improve decoder quality.Monolingual data from various languages is used to pretrain each language model.Further, we incorporate crosslingual em-beddings.These embeddings aim to learn universal representations that encode sentences into shared embedding spaces.Various recent work in crosslingual embeddings (Conneau and Lample, 2019) show strong performance on other multilingual tasks, such as XNLI (Conneau et al., 2018), XTREME (Hu et al., 2020), and MLQA (Lewis et al., 2019b).We use the embeddings from XLM (Conneau and Lample, 2019) to initialize the multilingual embeddings of our decoder.

Model Training
To train our one-to-many multilingual AMR-to-text generation model, we use pairs of English AMR and text in multiple different languages.The English AMR does not need to be aligned to sentences in multiple languages.Instead, we create one AMRto-text corpus for each language and concatenate all of them for training a multilingual model.During the training process, the pretrained AMR encoder and pretrained crosslingual decoder are finetuned on our multilingual AMR-to-text training corpus.

Experimental Setting
We describe the various sources of data used to create multilingual AMR-to-text generation models and describe the implementation and evaluation.

Data
Pretraining For encoder pretraining on silver AMR, we take thirty million sentences from the English portion of CCNET2 (Wenzek et al., 2019), a cleaned version of Common Crawl (an open source version of the web).We use jamr3 to parse English sentences into AMR.For multilingual decoder pretraining, we take thirty million sentences from each language split of CCNET.
Multilingual Data We use EUROPARL, an aligned corpus of European Union parliamentary debates.Each language in EUROPARL is aligned to English.We study the twenty one languages available in EUROPARL: Bulgarian, Czech, Danish, Dutch, English, German, Greek, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Polish, Portuguese, Romanian, Slovak, Slovenian, and Swedish.The earliest releases in EUROPARL were prepared with a fixed common testing set across all languages, but later releases in ten new languages do not have a validation or test set.Thus, for the languages where the standard split is applicable, we report results on the common testing set, splitting it in half for validation and testing.For languages where there is no evaluation set, we take a part of the training set and reserve it for validation and another portion for testing.We use jamr to parse the English text of the Europarl corpus into AMRs.This creates a corpus of automatically created silver English AMRs aligned with sentences in twenty one European languages.
Gold AMR We also evaluate our models (trained on silver AMRs) on gold AMR where available.For this, we use the CROSSLINGUAL AMR dataset from Damonte and Cohen (2018) 4 .The corpus was constructed by having professional translators translate the English text of the LDC2015E86 test set into Spanish, Italian, German, and Chinese.We only evaluate on languages where we have training data from EUROPARL (i.e.we do not include Chinese as it is not in EUROPARL).
Preprocessing All data remains untokenized and cased.For AMR, we follow Konstas et al. (2017) in processing the jamr output into a simpler form.We remove variable names and instance-of relation ( / ) before every concept.However, we do not anonymize entities or dates, as improvements in modeling have allowed for better representations of rare words such as entities.We learn a sentencepiece model with 32K operations to split the English AMR into subword units.On the decoder side, we apply the sentencepiece model and vocabulary of XLM (Conneau and Lample, 2019).We choose to use the existing XLM sentencepiece and vocabulary so that the XLM cross-lingual embeddings can be used to initialize our models.For the encoder, we do not use existing vocabularies, as they do not capture the AMR graph structure.

Models
We implement our models in fairseq-py (Ott et al., 2019).We use large Transformer (Vaswani et al., 2017) sequence-to-sequence models and train all models for 50 epochs with LayerDrop (Fan et al., 2019b), which takes around 2 days.We initialize all weights with the pretrained models.When combining crosslingual word embeddings and encoder and decoder pretraining, we initialize all weights with pretraining, then use crosslingual word embeddings.We do not perform extensive hyperparameter search, but experimented with various learning rate values to maintain stable training with pretrained initialization.To generate, we decode with beam search with beam size 5.Our pretrained models are available for download.5

Monolingual and Translation Baselines
We compare our multilingual models both to monolingual models (one model trained for each language) and to a hybrid NLG/MT baseline.For the latter, we first generate with the AMR-to-English model and then translate the generation output to the target language using MT.Our translation models are Transformer Big models trained with Lay-erDrop (Fan et al., 2019b) for 100k updates on public benchmark data from WMT where available and supplemented with mined data from the ccMatrix project (Schwenk et al., 2019).We trained translation models for languages where large quantities of aligned bitext data are readily available, and cover a variety of languages.

Evaluation
We evaluate with detokenized BLEU using sacrebleu (Post, 2018).We conduct human evaluation by asking native speakers to evaluate word order, morphology, semantic faithfulness (with respect to the reference) and paraphrasing (how much the generation differs from the reference) on a 3 point scale.The evaluation was done online.For each language, evaluators annotated 25 test set sentences with high BLEU score and 25 sentences with low BLEU score.We removed sentences that were shorter than 5 words.As it is difficult to ensure high quality annotations for 21 languages using crowdsourcing, we relied on colleagues by reaching out on NLP and Linguistics mailing lists.As a result, the number of evaluators per language varies (cf.Table 3).We evaluate multilingual AMR-to-Text generation models in 21 languages.We conduct an ablation study which demonstrates the improvements in modeling performance induced by incorporating graph embeddings, cross lingual embeddings, and pretraining.Finally, we analyze model performance with respect to several linguistic attributes (word order, morphology, paraphrasing, semantic faithfulness) using both automatic metrics and human evaluation.

Multilingual AMR-to-Text Generation
Monolingual vs. Multilingual Models.We compare English-XX baselines trained to generate from AMR into a single language with multilingual models.We note that as the English-XX models specializes for each language, they have less to model with the same parameter capacity.Results are shown in Table 1.Overall, multilingual models perform well -on 18 of the 21 languages, the performance measured by BLEU is stronger than the monolingual baseline.
One advantage of multilingual AMR-to-Text generation is increased quantities of AMR on the encoder side.This is particularly helpful when the size of the training data is low.For instance, Estonian (et) sees a 2.3 BLEU point improvement from multilingual modeling.Conversely, languages such as English, Swedish and French benefit less from multilingual modeling, most likely because there is sufficient data for those languages already.More generally, there is a marked difference between lan-guages for which the training data is large and those for which the training data is smaller.When the training data is large (1.9 to 2M training instances, top part of Table 1), the average improvment is +0.36 BLEU (Min:-0.2,Max:+0.9)whereas for languages with smaller training data (400 to 620K training instances, bottom part of Table 1 6 ), the average improvement is +1.75 (Min:+1, Max:+2.3).These trends are similar to observations on other tasks -namely that pretraining is most helpful when there is not sufficient training data in the task itself to train strong representations.

Results
Performance on Gold English AMR We evaluate our models trained on silver AMR on the CROSSLINGUAL AMR dataset from Damonte and Cohen (2018) where the input is a gold Englishcentric AMR and the output is available in three European languages: Spanish, French, and Italian.The results are shown in Table 2. Similar to the trends seen when generating from silver AMR, we find that multilingual models have strong performance.BLEU scores are lower than on EUROPARL as the models are tested out of domain (training on parliamentary debates but testing on newswire and forum data domains).
On English LDC data, we compare to existing work.Even though it is trained on silver AMRs and out of domain, non-LDC data, the multilingual model compares well with previous work (see Table 2).When finetuned on the LDC2015E86 train set, our model improves on English by over 1 BLEU point, outperforming all previous work except Zhu et al. (2019).This work directly models the graph structure of AMR with structure aware attention to improve Transformer architecturesthis is orthogonal to our main aim of multilingual generation and can be incorporated in future work.
Impact of Modeling Improvements.For the multilingual model, we display the effect of incrementally adding additional modeling improvements (cf.Table 1).Each improvement is essentially universally helpful across all considered languages, though some have a greater improvement on performance than others.
Comparison to the Hybrid NLG/MT Baseline.Compared to the NLG/MT baseline, our multilingual models provide comparable results while providing an arguably simpler approach (end-to-end rather than pipeline) and training on much lower quantities of parallel data -on German and French (very high resource languages with millions of examples of training data), there is slightly stronger performance.On other languages we compare to, the translation models perform a bit worse.
We further conduct a human evaluation study on Spanish, Italian, and German.We ask evaluators to assess the morphology, word order, and semantic accuracy of our Multilingual AMR to Text system compared to this hybrid English AMR to Text + Machine Translation baseline.We show in Table 4 that the two models score very similarly in human evaluation, indicating the strength of this fully multilingual system in producing fluent output.

Analysis of Multilingual Generation
A core challenge for multilinguality is that languages differ with respect to word order and morphology, so models must learn this per language.We use automatic and human evaluation to investigate how these differences affect performance.
Morphology Instead of operating on words, our models use sentencepiece (Wu et al., 2016), a datadriven approach to break words into subwords.As shown in Wu and Dredze (2019), in transferbased approaches to natural language understanding tasks, the proportion of subwords shared between the source and the transfer language impacts  We further assess morphology by asking human evaluators to grade the morphology of sentences (Is the morphology correct?Are agreement constraints e.g., verb/subject, noun/adjective respected?) on a scale from 1 to 3 with 3 being the highest score.As Table 3 shows, there is not much difference in performance between languages even though there is a marked difference in terms of agreement constraints between e.g., Finnish and English.Between annotators, agreement was high -the standard deviation across was low, with the exception of Romanian, Hungarian, and Spanish (as shown in Table 3).This demonstrates the surprisingly high ability of multilingual models to generalize across languages.
Word Order To assess the impact of varied word orders by language, we ask human evaluators to Table 4: Human Evaluation of our approach compared to the Hybrid English AMR to Text + Machine Translation baseline using Gold AMR from LDC2015E86.Two native speakers per language assess fifty sentences each on a scale of 1 to 3, with 3 being the highest score.
English Generation This point will certainly be the subject of subsequent further debates in the council.Reference This is a point that will undoubtedly be discussed later in the Council.

French
Generation Je ne suis pas favorable à des exceptions à cette règle.Reference A mon avis, il n'est pas bon de faire des exceptions à cette règle.

Reference
Vi har därför inte röstat för detta betänkande.judge if the word order is natural.As shown in Table 3, for all languages except Latvian and Romanian, the score is very high (close to 3) indicating that the model learns to decode into multiple languages even though word order differs.The agreement between annotators was high, with low standard deviation (see Table 3).Further, the attention pattern between the encoder English AMR and the decoder clearly reflects the word order of the various languages.This is illustrated in Figure 3, where the activation pattern mirrors the word order difference between French (1) and German (2).
(1) ont tenu (une réunion de groupe) OBJ (en Janvier 2020) TIME (à New York) LOC (2) hielten (im Januar 2020) TIME (in New York) LOC (eine Gruppestreffen) OBJ Training on Related Languages Multilingual models have the potential to benefit from similarities between languages.Languages of the same family often have shared morphological characteristics and vocabulary.First, we analyze the performance of training on languages within a family.Table 6 displays that a model trained on languages within a family has the strongest performance.Second, we analyze languages within the same family.For four families: Romance, Germanic, Uralic, and Slavic, we create multilingual models trained on pairs.One pair is for the most related languages within that family (e.g.Spanish and Portuguese) and another pair is for the farthest languages within that family (e.g.Spanish and Romanian).We determine which pairs are close and far from Ahmad et al. (2019).Results in Figure 5 display that training on pairs of closely related languages has better performance than pairs of less closely related languages, even within a family.Multilingual models could pick up on similarities between languages to improve performance.
Semantic Accuracy and Paraphrasing.We ask human evaluators to grade the faithfulness of the hypothesis compared to the reference on a scale of 1 to 3. As shown in Table 3, the overall semantic accuracy is very high (note a score of 2 indicates minor differences).We also asked annotators to evaluate how different the generated sentence was from the reference.When coupled with the semantic accuracy score, this allows us to evaluate generation of true paraphrases i.e., sentences with the same meaning as the reference but different surface form.In Table 3, Good Paraphrases indicates the percentage of cases that scored highly (2 or 3) with respect to both semantic adequacy and paraphrasing.A large majority of generated sentences are labeled as valid paraphrases by native speakers, indicating (i) that despite underspecified input, the written sentence retains the meaning of the reference and (ii) that this underspecification allows for the generation of paraphrases.This also suggests that BLEU scores only partially reflect model performance as good paraphrases typically differ from the reference and are likely to get lower BLEU score even though they may be semantically accurate.Table 5 shows some examples illustrating the paraphrasing potential of the approach.

Conclusion
Abstract Meaning Representations were designed to describe the meaning of English sentences.As such they are heavily biased towards English.AMR concepts are either English words, PropBank framesets ("want-01") or special, English-based keywords (e.g., "date-entity").The structure of AMRs is also influenced by English syntax.For instance, the main relation of "I like to eat" is the concept associated with its main verb ("like") whereas given the corresponding German sentence "Ich esse gern" (Lit."I eat willingly"), the main predicate might have been chosen to be "eat" ("essen").In other words, AMRs should not necessarily be viewed as an interlingua (Banarescu et al., 2013).Nonetheless, our work suggests that it can be used as one: given an English-centric AMR it is possible to generate the corresponding sentence in multiple languages.This is in line with previous work by (Damonte and Cohen, 2019) which shows that despite translation divergences, AMR parsers can be learned for Italian, Chinese, German and Spanish which all map into an English-centric AMR.

Figure 1 :
Figure 1: Generating into Multiple Languages from English AMR.

Figure 2 :
Figure 2: One-to-Many Architecture for Multilingual AMR-to-Text Generation.The English-centric AMR input is linearized and modeled with graph embeddings with a pre-trained Transformer Encoder.Text is generated with a pre-trained Transformer Decoder initialized with cross-lingual embeddings.

Figure 3 :
Figure 3: Attention alignment when decoding in French and German from the same input AMR.

Figure 4 :
Figure 4: Relationship between BLEU Score and Token Overlap for all 21 languages.Correlation coefficient between word overlap and BLEU is 0.42, and coefficient between subword overlap and BLEU is 0.26.

Figure 5 :
Figure 5: BLEU difference training on Close v. Far Languages within One Family.Training on a close pair consistently improves performance compared to training on a far pair, even within a language family.

Table 1 :
Results on 21 Languages in Europarl.The English-XX baseline (generation into a single language) combines all modeling improvements.When training on multiple seeds, the standard deviation is around 0.1 to 0.3 BLEU, making the difference between the multilingual baseline and the addition of our modeling improvements statistically significant.

Table 2 :
Results on Gold AMR from LDC2015E86.

Table 3 :
Human Evaluation.Native speakers assess fifty sentences on a scale of 1 to 3, with 3 the highest score.Good Paraphrases are sentences with high scores (2 or 3) for both Semantic Accuracy and Paraphrasing.

Table 5 :
Example Paraphrases generated by our multilingual model.

Table 6 :
Performance when training with increasingly more languages.Training one multilingual AMR-to-Text model with languages in the related language family improves performance.