Linguistic realisation as machine translation: Comparing different MT models for AMR-to-text generation

In this paper, we study AMR-to-text generation, framing it as a translation task and comparing two different MT approaches (Phrase-based and Neural MT). We systematically study the effects of 3 AMR preprocessing steps (Delexicalisation, Compression, and Linearisation) applied before the MT phase. Our results show that preprocessing indeed helps, although the benefits differ for the two MT models.


Introduction
Natural Language Generation (NLG) is the process of generating coherent natural language text from non-linguistic data (Reiter and Dale, 2000). While there is broad consensus among NLG scholars on the output of NLG systems (i.e., text), there is far less agreement on what the input should be; see Gatt and Krahmer (2017) for a recent review. Over the years, NLG systems have taken a wide range of inputs, including for example images (Xu et al., 2015), numeric data (Gkatzia et al., 2014) and semantic representations (Theune et al., 2001).
This study focuses on generating natural language based on Abstract Meaning Representations (AMRs) (Banarescu et al., 2013). AMRs encode the meaning of a sentence as a rooted, directed and acyclic graph, where nodes represent concepts, and labeled directed edges represent relations among these concepts. The formalism strongly relies on the PropBank notation. Figure 1 shows an example. 1 https://github.com/ThiagoCF05/LinearAMR AMRs have increased in popularity in recent years, partly because they are relatively easy to produce, to read and to process automatically. In addition, they can be systematically translated into firstorder logic, allowing for a well-specified modeltheoretic interpretation (Bos, 2016). Most earlier studies on AMRs have focused on text understanding, i.e. processing texts in order to produce AMRs (Flanigan et al., 2014;Artzi et al., 2015). However, recently the reverse process, i.e. the generation of texts from AMRs, has started to receive scholarly attention (Flanigan et al., 2016;Song et al., 2016;Pourdamghani et al., 2016;Song et al., 2017;Konstas et al., 2017).
We assume that in practical applications, conceptualisation models or dialogue managers (models which decide "what to say") output AMRs. In this paper we study different ways in which these AMRs can be converted into natural language (deciding "how to say it"). We approach this as a translation problem-automatically translating from AMRs into natural language-and the keycontribution of this paper is that we systematically compare different preprocessing strategies for two different MT systems: Phrase-based MT (PBMT) and Neural MT (NMT).
We look at potential benefits of three preprocessing steps on AMRs before feeding them into an MT system: delexicalisation, compression, and linearisation. Delexicalisation decreases the sparsity of an AMR by removing constant values, compression removes nodes and edges which are less likely to be aligned to any word on the textual side and linearisation 'flattens' the AMR in a specific order. Com- Figure 1: Example of an AMR bining all possibilities gives rise to 2 3 = 8 AMR preprocessing strategies, which we evaluate for two different MT systems: PBMT and NMT.
Following earlier work in AMR-to-text generation and the MT literature, we evaluate the system outputs in terms of fluency, adequacy and post-editing effort, using BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007) and TER (Snover et al., 2006) scores, respectively. We show that preprocessing helps, although the extent of the benefits differs for the two MT systems.

Related Studies
To the best of our knowledge, Flanigan et al. (2016) was the first study that introduced a model for natural language generation from AMRs. The model consists of two steps. First, the AMR-graph is converted into a spanning tree, and then, in a second step, this tree is converted into a sentence using a tree transducer.
In Song et al. (2016), the generation of a sentence from an AMR is addressed as an asymmetric generalised traveling salesman problem (AGTSP). For sentences shorter than 30 words, the model does not beat the system described by Flanigan et al. (2016). However, Song et al. (2017) treat the AMR-to-text task using a Synchronous Node Replacement Grammar (SNRG) and outperform Flanigan et al. (2016).
Although AMRs do not contain articles and do not represent inflectional morphology for tense and number (Banarescu et al., 2013), the formalism is relatively close to the (English) language. Motivated by this similarity, Pourdamghani et al. (2016) proposed an AMR-to-text method that organises some of these concepts and edges in a flat representation, commonly known as Linearisation. Once the linearisation is complete, Pourdamghani et al. (2016) map the flat AMR into an English sentence using a Phrase-Based Machine Translation (PBMT) system. This method yields better results than Flanigan et al. (2016) on development and test set from the LDC2014T12 corpus. Pourdamghani et al. (2016) train their system using a set of AMR-sentence pairs obtained by the aligner described in Pourdamghani et al. (2014). In order to decrease the sparsity of the AMR formalism caused by the ratio of broad vocabulary and relatively small amount of data, this aligner drops a considerable amount of the AMR structure, such as role edges :ARG0, :ARG1, :mod, etc. However, inspection of the gold-standard alignments provided in the LDC2016E25 corpus revealed that this rulebased compression can be harmful for the generation of sentences, since such role edges can actually be aligned to function words in English sentences. So having these roles available arguably could improve AMR-to-text translation. This indicates that a better comparison of the effects of different preprocessing steps is called for, which we do in this study.
In addition, Pourdamghani et al. (2016) use PBMT, which is devised for translation but also utilised in other NLP tasks, e.g. text simplification (Wubben et al., 2012;Štajner et al., 2015). However, these systems have the disadvantage of having many different feature functions, and finding optimal settings for all of them increases the complexity of the problem from an engineering point of view. An alternative MT model has been proposed: Neural Machine Translation (NMT). NMT models frame translation as a sequence-to-sequence problem (Bahdanau et al., 2015), and have shown strong results when translating between many different language pairs (Bojar et al., 2015). Recently, Konstas et al. (2017) introduce sequence-to-sequence models for parsing (text-to-AMR) and generation (AMR-totext). They use a semi-supervised training proce-dure, incorporating 20M English sentences which do not have a gold-standard AMR, thus overcoming the limited amount of data available. They report stateof-the-art results for the task, which suggests that NMT is a promising alternative for AMR-to-text.

Models
We describe our AMR-to-text generation models, which rely on 3 preprocessing steps (delexicalisation, compression, and/or linearisation) followed by a machine translation and realisation steps.

Delexicalisation
Inspection of the LDC2016E25 corpus reveals that on average 22% of the structure of an AMR are AMR constant values, such as names, quantities, and dates. This information increases the sparsity of the data, and makes it arguably more difficult to map an AMR into a textual format. To address this, Pourdamghani et al. (2016) look for special realisation component for names, dates and numbers in development and test sets and add them on the training set. On the other hand, similar to Konstas et al. (2017), we delexicalised these constants, replacing the original information for tags (e.g., name1 , quant1 ). A list of tag-values is kept, aiming to identifying the position and to insert the original information in the sentence after the translation step is completed. Figure 2 shows a delexicalised AMR.

Compression
Given the alignment between an AMR and a sentence, the nodes and edges in the AMR can either be aligned to words in the sentence or not. So before the linearisation step, we would like to know which elements of an AMR should actually be part of the 'flattened' representation.
Following the aligner of Pourdamghani et al. (2014), Pourdamghani et al. (2016) clean an AMR by removing some nodes and edges independent of the context. Instead, we are using alignments that may relate a given node or edge to an English word according to the context. In Figure 1 for instance, the first edge :ARG1 is aligned to the preposition to from the sentence, whereas the second edge with a similar value is not aligned to any word in the sentence. Therefore, we need to train a classifier to de-cide which parts of an AMR should be in the flattened representation according to the context.
To solve the problem, we train a Conditional Random Field (CRF) which determines whether a node or an edge of an AMR should be included in the flattened representation. The classification process is sequential over a flattened representation of an AMR obtained by depth first search through the graph. Each element is represented by their name and parent name. We use CRFSuite (Okazaki, 2007) to implement our model.

Linearisation
After Compression, we flatten the AMR to serve as input to the translation step, similarly as proposed in Pourdamghani et al. (2016). We perform a depthfirst search through the AMR, printing the elements according to their visiting order. In a second step, also following Pourdamghani et al. (2016), we implemented a version of the 2-Step Classifier from Lerner and Petrov (2013) to preorder the elements from an AMR according to the target side.

2-Step Classifier
We implement the preordering method proposed by Lerner and Petrov (2013) in the following way. We define the order among a head node and its subtrees in two steps. In the first, we use a trained maximum entropy classifier to predict for each subtree whether it should occur before or after the head node. As features, we represent the head node by its frameset, whereas the subtree is represented by its head node frameset and parent edge.
Once we divide the subtrees into the ones which should occur before and after the head node, we use a maximum entropy classifier for the size of the subtree group to predict their order. For instance, for a group of 2 subtrees, a maximum entropy classifier specific for groups of 2 subtrees would be used to predict the permutation order of them (0-1 or 1-0). As features, the head node is also represented by its PropBank frameset, whereas the subtrees of the groups are represented by their parent edges, their head node framesets and by which side of the head node they are (before or after). We train classifiers for groups of sizes between 2 and 4 subtrees. For bigger groups, we used the depth first search order.

Translation models
To map a flat AMR representation into an English sentence, we use phrase-based (Koehn et al., 2003) and neural machine translation (Bahdanau et al., 2015) models.

Phrase-Based Machine Translation
These models use Bayes rule to formalise the problem of translating a text from a source language f to a target language e. In our case, we want to translate a flat amr into an English sentence e as Equation 1 shows.
P (e | amr) = argmax P (amr | e)P (e) (1) The a priori function P (e) usually is represented by a language model trained on the target language. The a posteriori equation is calculated by the loglinear model described at Equation 2.
Each h j (amr, e) is an arbitrary feature function over AMR-sentence pairs. To calculate it, the flat amr is segmented into I phrasesāmr I 1 , such that each phraseāmr i is translated into a target phrasē e i as described by Equation 3.
As feature functions, we used direct and inverse phrase translation probabilities and lexical weighting; word, unknown word and phrase penalties.
We also used models to reorder a flat amr according to the target sentence e at decoding time. They work on the word-level (Koehn et al., 2003), at the level of adjacent phrases (Koehn et al., 2005) and beyond adjacent phrases (hierarchical-level) (Galley and Manning, 2008). Phrase-and hierarchical level models are also known as lexicalized reordering models.
As Koehn et al. (2003), given s i the start position of the source phraseāmr i translated into the English phraseē i , and f i−1 the end position of the source phraseāmr i−1 translated into the English phrasē e i−1 , a distortion model α |s i −f i−1 −1| is defined as a distance-based reordering model. α is chosen by tunning the model.
Lexicalised models are more complex than distance-based ones, but usually help the system to obtain better results (Koehn et al., 2005;Galley and Manning, 2008). Given a possible set of target phrases e = (ē 1 , ... ,ē n ) based on a source amr, and a set of alignments a = (a 1 , ... , a n ) that maps a source phraseāmr a i into a target phraseē i , a lexicalised model aims to predict a set of orientations o = (o 1 , ... , o n ) as Equation 4 shows. .
In the hierarchical model, we distinguished the discontinuous operation by the direction: discontinuous right (a i − a i−1 < 1) and discontinuous left (a i − a i−1 > 1). These models are important for our task, since the preordering method used in the Linearisation step can be insufficient to adequate it to the target sentence order.

Neural Machine Translation
Following the attention-based Neural Machine Translation (NMT) model introduced by Bahdanau et al. (2015), given a flat amr = (amr 1 , amr 2 , · · · , amr N ) and its English sentence translation e = (e 1 , e 2 , · · · , e M ), a single neural network is trained to translate amr into e by directly learning to model p(e | amr). The network consists of one encoder, one decoder, and one attention mechanism.
The encoder is a bi-directional RNN with gated recurrent units (GRU) (Cho et al., 2014), where one forward RNN − → Φ enc reads the amr from left to right and generates a sequence of forward annotation vectors ( at each encoder time step i ∈ [1, N ], and a backward RNN ← − Φ enc reads the amr from right to left and generates a sequence of backward annotation vectors The final annotation vector is the concatenation of forward and backward vec- is the set of source annotation vectors.
The decoder is a neural LM conditioned on the previously emitted words and the source sentence via an attention mechanism over C. A multilayer perceptron is used to initialise the decoder's hidden state s 0 , where the input to this network is the concatenation of the last forward and backward vectors − → h N ; ← − h 1 . At each time step t of the decoder, we compute a time-dependent context vector c t based on the annotation vectors C, the decoder's previous hidden state s t−1 and the target English wordẽ t−1 emitted by the decoder in the previous time step. A single-layer feed-forward network computes an expected align-ment a t,i between each source annotation vector h i and the target word to be emitted at the current time step t, as in (6): In Equation (7), these expected alignments are normalised and converted into probabilities: where α t,i are called the model's attention weights, which are in turn used in computing the timedependent context vector c t = N i=1 α t,i h i . Finally, the context vector c t is used in computing the decoder's hidden state s t for the current time step t, as shown in Equation (8): where s t−1 is the decoder's previous hidden state, W e [ẽ t−1 ] is the embedding of the word emitted in the previous time step, and c t is the updated timedependent context vector. Given a hidden state s t , the probabilities for the next target word are computed using one projection layer followed by a softmax, as illustrated in eq. (9), where the matrices L o , L s , L w and L c are transformation matrices and c t is the time-dependent context vector.

Realisation
Since we delexicalise names, dates, quantities and values from AMRs, we need to textually realise this information once we obtain the results from the translation step. As we kept all the original information and their relation with the tags, we just need to replace one for the other. We implement some rules to adequate our generated texts to the ones we saw in the training set. Different from the AMRs, we represent months nominally, and not numerically -month 5 will be May for example. Values and quantities bigger than a thousand are also part realised nominally. The value 8500000000 would be realised as 8.5 billion for instance. On the other hand, names are realised as they are.
p(e t = k | e <t , c t ) ∝ exp(L o tanh(L s s t + L w E e [ê t−1 ] + L c c t )). (9)

Data
We used the corpus LDC2016E25 provided by the SemEval 2017 Task 9 in our evaluation. This corpus consists of aligned AMR-sentence pairs, mostly newswire. We considered the train/dev/test sets splitting proposed in the original setting, totaling 36,521, 1,368 and 1,371 AMR-sentence pairs, respectively. Compression and Linearisation methods, as well as Phrase-based Machine Translation models were trained over the gold-standard alignments between AMRs and sentences on the training set of the corpus.

Evaluated Models
We test models with and without the Delexicalisation/Realisation (-Delex and +Delex) and Compression (-Compress and +Compress) steps. In models without the Compression step, we include all the elements from an AMR in the flattened representation. For the Linearisation step, we flatten the AMR structure based on a depth-first search (-Preorder) or preorder it with our 2-step classifier (+Preorder). Finally, we translate a flattened AMR into text using a Phrase-based (PBMT) and a Neural Machine Translation model (NMT). In total, we evaluated 16 models.
Phrase-based Machine Translation We used a standard PBMT system built using Moses toolkit (Koehn et al., 2007). At training time, we extract and score phrase sentences up to the size of 9 tokens. All the feature functions were trained using the gold-standard alignments from the training set and their weights were tuned on the development data using k-batch MIRA with k = 60 (Cherry and Foster, 2012) with BLEU as the evaluation metric. A distortion limit of 6 was used for the reordering models. Lexicalised reordering models were bidirectional. At decoding time, we use a stack size of 1000.
Our language model P (e) is a 5-gram LM trained on the Gigaword Third Edition corpus using KenLM (Heafield et al., 2013). For the models with the Delexicalisation step, we trained the language model with a delexicalised version of Gigaword by parsing the corpus using the Stanford Named Entity Recognition tool (Finkel et al., 2005). All the entities labeled as LOCATION, PERSON, ORGANISATION or MISC were replaced by the tag nameX . Entities labeled as NUMBER or MONEY were replaced by the tag quantX . Finally, entities labeled as PERCENT or ORDINAL were replaced by valueX . In the tags, X is replaced by the ordinal position of the entity in the sentence.

Neural Machine Translation
The encoder is a bidirectional RNN with GRU, each with a 1024D hidden unit. Source and target word embeddings are 620D each and are both trained jointly with the model. All non-recurrent matrices are initialised by sampling from a Gaussian (µ = 0, σ = 0.01), recurrent matrices are random orthogonal and bias vectors are all initialised to zero. The decoder RNN also uses GRU and is a neural LM conditioned on its previous emissions and the source sentence by means of the source attention mechanism.
We apply dropout with a probability of 0.3 in both source and target word embeddings, in the encoder and decoder RNNs inputs and recurrent connections, and before the readout operation in the decoder RNN. We follow Gal and Ghahramani (2016) and apply dropout to the encoder and decoder RNNs using the same mask in all time steps.
Models are trained using stochastic gradient descent with Adadelta (Zeiler, 2012) and minibatches of size 40. We apply early stopping for model selection based on BLEU scores, so that if a model does not improve on the validation set for more than 20 epochs, training is halted.

Models for Comparison
We compare BLEU scores for some of the AMRto-text systems described in the literature (Flanigan et al., 2016;Song et al., 2016;Pourdamghani et al., 2016;Song et al., 2017;Konstas et al., 2017). Since the models of Flanigan et al. (2016) and Pourdamghani et al. (2016) are publicly available, we also use them with the same training data as our models. For Flanigan et al. (2016), we specifically use the version available on GitHub 2 .
For Pourdamghani et al. (2016), we use the version available at the first author's website 3 . The rules used for the preordering model and the feature functions from the PBMT system are trained using alignments over AMR-sentence pairs from the training set obtained with the aligner described by Pourdamghani et al. (2014). We do not use lexicalised reordering models as Pourdamghani et al. (2016). Moreover, we tune the weights of the feature functions with MERT (Och, 2003).
Both models make use of a 5-gram language model trained on Gigaword Third Edition corpus with KenLM.

Results
Table 1 depicts the scores of the different models by the size of the data they were trained on. For illustration, we depicted the BLEU scores of all the AMR-to-text systems described in the literature. The models of Flanigan et al. (2016) and Pourdamghani et al. (2016) were officially trained with 10,313 AMR-sentence pairs from the LDC2014T12 corpus, and with 36,521 AMR-sentence pairs from the LDC2016E25 in our study (as our models). The ones of Song et al. (2016) and Song et al. (2017) were trained with 16,833 pairs from the LDC2015E86 corpus. Konstas et al. (2017), which presents the highest quantitative result in the task so far, also used the LDC2015E86 corpus plus 20 million English sentences from the Gigaword corpus with a semi-supervised approach. We report the results when their model were trained only with AMRsentence pairs from the corpus, and when improved with more 20 million sentences.
In our NMT models, apparently the Compression step is harmful to the task, whereas Delexicalisation and preordering in Linearisation lead to better results. However, none of the NMT models outperform neither the PBMT models nor the baselines.

Discussion
In this paper, we studied models for AMR-to-text generation using machine translation. We systematically analysed the effects of 3 processing strategies on AMRs before feeding them either to a Phrasebased or a Neural MT system. The evaluation was performed on the LDC2016E25 corpus, provided by SemEval 2017 Task 9. All the models had the fluency, adequacy and post-editing effort of their produced sentences measured by BLEU, METEOR and TER, respectively. In general, we found that pro-cessing AMRs helps, although the effects differ for the different systems.
Phrase-based MT Delexicalisation (+Delex) does not seem to play a role in obtaining better sentences from AMRs using PBMT. Our best model (PBMT-Delex+Compress+Preorder) presents competitive results to Pourdamghani et al. (2016) with the advantage that no technique is necessary to overcome data sparsity.
Compressing an AMR graph with a classifier shows improvements over a comparable model without compression, but not as strong as preordering the elements in the Linearisation step. In fact, preordering seems to be the most important preprocessing step across all three MT preprocessing metrics. We note that the preordering success was expected, based on previous results (Pourdamghani et al., 2016).
Neural MT The first impression from our NMT experiments is that using Compression consistently deteriorates translations according to all metrics evaluated. Delexicalisation seems to improve results, corroborating the findings from Konstas et al. (2017). While Delexicalisation is harmful and Compression is beneficial for PBMT, we see the opposite in NMT models. Besides the differences between these two MT architectures, applying preordering in the Linearisation step improves results in both cases. This seems to contradict the finding in Konstas et al. (2017) regarding neural models. We conjecture that the additional training data used by Konstas et al. (2017) may have decreased the gap between using and not using preordering (see also below). More research is necessary to settle this point.
PBMT vs. NMT PBMT models generate much better sentences from AMRs than NMT models in terms of fluency, adequacy and post-editing effort. We believe that the lower performance of NMT models is due to the small size of the training set (36,521 AMR-sentence pairs). Neural models are known to perform well when trained on much larger data sets, e.g. in the order of millions of entries, as exemplified by Konstas et al. (2017). PBMT models trained on small data sets clearly outperform NMT ones, e.g. Konstas et al. (2017)  Model comparison While the best PBMT models are comparable to the state-of-the-art AMR-totext systems, the current best results are reported by Konstas et al. (2017), showing the potential of applying deep learning onto large amounts of training data with a 33.8 BLEU-score. However, this result crucially relies on the existence of a very large dataset. Interestingly, when applied in a situation with limited amounts of data, Konstas et al. (2017) report substantially lower performance scores. In such situations, our PBMT models, like Pourdamghani et al. (2016), look appear to be a good alternative option.

Conclusion
In this work, we systematically studied different MT models to translate AMRs into natural language. We observed that the Delexicalisation, Compression, and Linearisation steps have different impacts on AMR-to-text generation depending on the MT architecture used. We observed that delexicalising AMRs yields the best results in NMT models, in contrast to PBMT models. On the other hand, for both PBMT models and NMT models, preordering the AMR in Linearisation introduces better results.
Among our models, PBMT generally outperforms NMT. Finally, the literature suggests that the improvements obtained by having more data are larger than those obtained with improved preprocessing strategies. Nonetheless, combining the right preprocessing strategy with large volumes of training data should lead to further improvements.