Surface Realisation Using Full Delexicalisation

Surface realisation (SR) maps a meaning representation to a sentence and can be viewed as consisting of three subtasks: word ordering, morphological inflection and contraction generation (e.g., clitic attachment in Portuguese or elision in French). We propose a modular approach to surface realisation which models each of these components separately, and evaluate our approach on the 10 languages covered by the SR’18 Surface Realisation Shared Task shallow track. We provide a detailed evaluation of how word order, morphological realisation and contractions are handled by the model and an analysis of the differences in word ordering performance across languages.


Introduction
Surface realisation maps a meaning representation to a sentence. In data-to-text generation, it is part of a complex process aiming to select, compress and structure the input data into a text. In textto-text generation, it can be used as a mean to rephrase part or all of the input content. For instance, Takase et al. (2016) used surface realisation to generate a summary based on the meaning representations of multiple input documents and Liao et al. (2018) to improve neural machine translation.
By providing parallel data of sentences and their meaning representation, the SR'18 Surface Realisation shared task (Mille et al., 2018) allows for a detailed evaluation and comparison of surface realisation models. Moreover, as it provides training and test data for multiple languages, it also allows for an analysis of how well these models handle languages with different morphological and topological properties.
The SR'18 shared task includes two tracks: a shallow track where the input is an unordered, lemmatised dependency tree and a deep track where function words are removed and syntactic relations are replaced with semantic ones. In this paper, we focus on the shallow track of the SR'18 Shared Task and we propose a neural approach which decomposes surface realisation into three subtasks: word ordering, morphological inflection and contraction generation (e.g., clitic attachment in Portuguese or elision in French). We provide a detailed analysis of how each of these phenomena (word order, morphological realisation and contraction) is handled by the model, and we discuss the differences between languages.

Related Work
Early approaches for surface realisation adopted statistical methods, including both pipelined (Bohnet et al., 2010) and joint (Song et al., 2014;Puduppully et al., 2017) architecture for word ordering and morphological generation.
Multilingual SR'18 was preceded by the SR'11 surface realisation task for the English language only (Belz et al., 2011). The submitted systems in 2011 had grammar-based and statistical nature, mostly relying on pipelined architecture.
Recently, Marcheggiani and Perez-Beltrachini (2018) proposed a neural end-to-end approach based on graph convolutional encoders for the SR'11 deep track.
The SR'18 shallow track received submissions from eight teams with seven of them dividing the task into two subtasks: word ordering and inflection. Only Elder and Hokamp (2018) developed a joint approach, however, they participated only in the English track.
For word ordering, five teams chose an approach based on neural networks, two used a classifier, and one team resorted to a language model. As for the inflection subtask, five teams applied neural techniques, two used lexicon-based approaches, and one used an SMT system (Basile and Mazzei, 2018;Castro Ferreira et al., 2018;Elder and Hokamp, 2018;King and White, 2018;Madsack et al., 2018;Puzikov and Gurevych, 2018;Singh et al., 2018;Sobrevilla Cabezudo and Pardo, 2018). Overall, neural components were dominant across all the participants. However, official scores of the teams that went neural greatly differ. Furthermore, two teams (Elder and Hokamp, 2018;Sobrevilla Cabezudo and Pardo, 2018) applied data augmentation, which makes their results not strictly comparable to others.
One of the interesting findings of the shared task is reported by Elder and Hokamp (2018) who showed that applying standard neural encoderdecoder models to jointly learn word ordering and inflection is highly challenging; their sequence-tosequence baseline without data augmentation got 43.11 BLEU points on English.
Our model differs from previous work in three main ways. First, it performs word ordering on fully delexicalised data. Delexicalisation has been used previously but mostly to handle rare words, e.g. named entities. Here we argue that surface realisation and, in particular, word ordering works better when delexicalising all input tokens. This captures the intuition that word ordering is mainly determined by the syntactic structure of the input. Second, we we provide a detailed evaluation of how our model handles the three subtasks underlying surface realisation. While all SR'18 participants provided descriptions of their models, not all of them performed an in-depth analysis of model performance. Exceptions are works of King and White (2018), who provided a separate evaluation for the morphological realisation module, and Puzikov and Gurevych (2018), who evaluated both word ordering and inflection modules. However, it is not clear how each of those modules affect the global performance when merged in the full pipeline. In contrast, we propose a detailed incremental evaluation of each component of the full pipeline and show how each component impacts the final scores. Third, we introduce a linguistic analysis, based on the dependency relations, of the word ordering component, allowing for deeper error analysis of the developed systems. Furthermore, our model explicitly integrates a module for contraction handling, as done also before by Basile and Mazzei (2018). We also address all the ten languages proposed by the shared task and outline the importance of handling contractions.

Data
The SR'18 data (shallow track) is derived from the ten Universal Dependencies (UD) v2.0 treebanks (Nivre et al., 2017) and consists of (T, S) pairs where S is a sentence, and T is the UD dependency tree of S after word information has been removed and tokens have been lemmatised. The languages are those shown in Table 1 and the size of the datasets (training, dev and test) varies between 7,586 (Arabic) and 85,377 (Czech) instances with most languages having around 12K instances (for more details about the data see Mille et al. (2018)).

Model
As illustrated by Example 1, surface realisation from SR'18 shallow meaning representations can be viewed as consisting of three main steps: word ordering, morphological inflection and contraction generation. For instance, given an unordered dependency tree whose nodes are labelled with lemmas and morphological features (1a) 1 , the lemmas must be assigned the appropriate order (1b), they must be inflected (1c) and contractions may take place (1d).
(1) a. the find be not meaning of life it about b. it be not about find the meaning of life c. It is n't about finding the meaning of life d. It isn't about finding the meaning of life We propose a neural architecture which explicitly integrates these three subtasks as three separate modules into a pipeline: word ordering (WO) is applied first, then morphological realisation (WO+MR) and finally, contractions (WO+MR+C) are handled.

Word Ordering
For word ordering, we combine a factored sequence-to-sequence model with an "extreme delexicalisation" step which replaces matching source and target tokens with an identifier. Delexicalisation. Delexicalisation has frequently been used in neural NLG to help handle unknown or rare items (Wen et al., 2015;Dušek and Jurcicek, 2015;Chen et al., 2018). Rare items are replaced by placeholders both in the input and in the output; models are trained on the delexicalised data; and a post-processing step ensures that the generated text is relexicalised using the placeholders' original value. In these approaches, delexicalisation is restricted to rare items (named entities). In contrast, we apply delexicalisation to all input lemmas. Abstracting away from specific lemmas reduces data sparsity, allows for the generation of rare or unknown words and last but not least, it captures the linguistic intuition that word ordering mainly depends on syntactic information (e.g., in English, the subject generally precedes the verb).
To create the delexicalised data, we need to identify matching input and output elements and to replace them by the same identifier. We also store a mapping (id, L, F) specifying which identifier id refers to which (L, F) pair, where L is a lemma and F is its set of morpho-syntactic features.
We identify matching input and output elements by comparing the unordered input tree provided by the SR'18 task with the parse tree of the output sentence provided by the UD treebanks (cf. Figure 1). Source and target nodes which share the same path to the root are then mapped to the same identifier. For instance, in Figure 1, the lemma "apple" has the same path to the root (obj:eat:root) in both the input and the output tree. Hence the same identifier is assigned to the nodes. More generally, after linearisation through depth-first, left-to-right traversal of the input tree, each training instance captures the mapping between lemmas in the input tree and the same lemmas in the output sequence. For instance, given the example shown in Figure 1, delexicalisation will yield the training instance: Input: tkn2 tkn3 tkn4 tkn1 Output: tkn4 tkn1 tkn3 tkn2 where tkn i is the factored representation (see below) of each delexicalised input node.
Factored Sequence-to-Sequence Model. Following Elder and Hokamp (2018), we use a factored model (Alexandrescu and Kirchhoff, 2006) as a means of enriching the node representations input to the neural model. Each delexicalised tree node is modelled by a sequence of features. Separate embeddings are learned for each feature type and the feature embeddings of each input node are concatenated to create its dense representation. As exemplified in Figure 1, we model each input placeholder as a concatenation of four features: the node identifier, its POS tag, its dependency relation to the parent node and its parent identifier 2 .
Sequence-to-sequence model. We use the OpenNMT-py framework (Klein et al., 2017) 3 to train factored sequence-to-sequence models with attention (Luong et al., 2015) and the copy and coverage mechanisms described in See et al. (2017). A single-layer LSTM is used for both encoder and decoder. We train using full vocabulary and the maximal length in the source and target for both baseline and the proposed model. Models were trained for 20 epochs, with a mini-batch size of 64, a word embedding size of 300, and a hidden unit size of 450. They were optimised with SGD with a starting learning rate of 1.0. A learning rate is halved when perplexity does not decrease on the development set. Preliminary experiments showed that the lowest perplexity was reached on average at epoch 17, so this model was kept for decoding. Decoding is done using beam search with a beam size of 5. For each language, we train three models with different random seeds, and report the average performance and standard deviation.
The model is trained on delexicalised data. At test time, the token/identifier mapping is used to relexicalise the model output.

Morphological Realisation
The morphological realisation (MR) module consists in producing inflected word forms based on lemmas coupled with morphological features. For that module, we used a model recently proposed by Aharoni and Goldberg (2017), which achieves state-of-the-art results on several morphological inflection datasets: the CELEX dataset (Baayen et al., 1993;Dreyer et al., 2008), the Wiktionary dataset (Durrett and DeNero, 2013) and the SIG-MORPHON2016 dataset (Cotterell et al., 2016). Their model is based on a neural encoder-decoder architecture with hard monotonic attention and performs out-of-context morphologic realisation: given a lemma and a set of morpho-syntactic features, it produces a corresponding word form.
We trained the model of Aharoni and Goldberg (2017) on (lemma+morpho-syntactic features 4 , form) pairs extracted from the SR'18 training data. We trained the model for 20 epochs with the default parameters provided in the implementation 5 .
In our pipeline architecture, morphological realisation is applied to the output of our word ordering model using the (id, L, F) mapping mentioned above. For each delexicalised token produced by the word ordering component, we retrieve the corresponding lemma and morpho-syntactic features (L, F) and apply our MR model to it so as to produce the corresponding word form.
While associating a lemma and its features to a corresponding form, the MR module operates without taking context into account, so it cannot perform some finer grained operations, such as contraction, elision, and clitic attachment. We address that issue in the following section.

Contraction Generation
Contraction handling is the last step of our surface realisation pipeline. Example 2 shows some types of contractions.
The sequence-to-sequence model is trained on pairs of sentences without and with contractions. The sentence with contraction (S +c ) is the final sentence, i.e., the reference sentence in the SR'18 data. The sentence without contraction (S −c ) is the corresponding sequence of word forms extracted from the UD CoNLL data.
The regular expression module is inspired by the decomposition of multi-word expressions, such as contractions, which is applied during the tokenisation step in parsing (Martins et al., 2009). We reversed the regular expressions given in the TurboParser 6 for the surface realisation task, and also added our own to tackle, for example, elision in French. C s2s and C reg modules were created for three languages: French, Italian, and Portuguese 7 .

Evaluation
We evaluate each component of our approach separately. We start by providing a detailed evaluation of how the model handles word ordering (Section 5.1). We then go on to analyse the respective contributions of morphological realisation (Section 5.2) and contraction generation (Section 5.3). Finally, we discuss the performance of the overall surface realisation model (Section 5.4).
Throughout the evaluation, we used the SR'18 evaluation scripts to compute automatic metrics 8 .

BLEU scores
We evaluate our word ordering component by computing the BLEU-4 score (Papineni et al., 2002) between the sequence of lemmas it produces and the lemmatized reference sentence extracted from the UD files. The baseline is the same model without delexicalisation. As Table 1 shows, there is a marked, statistically significant, difference between the baseline and our approach which indicates that delexicalisation does improve word ordering.

Word Ordering Constraints
We also investigate the degree to which our results conform with the word ordering constraints of the various languages focusing on the following dependency relations: DET (determiner), NSUBJ (nominal subject), OBJ (object), AMOD (adjectival modifier) and ACL (nominal clausal modifier). For each of these dependency relations, we compare the relative ordering of the corresponding (head, dependent) pairs in the reference data and in our system predictions.
To determine whether the dependent should precede or follow its head, we use the gold standard dependency tree of the UD treebanks. Since for the system predictions we do not have a parse tree, we additionally record the distance between head and dependent (in the reference data) and we compare it with the distance between the same two items in the system output. For instance, for the DET relation, given the gold sentence (3a) and the generated sentence (3b), we extract (3c) from the UD parse tree and (3d) from the predicted sentence where each triple is of the form either (dep, head, distance) or (head, dep, distance) and distance is the distance between head and dependent.
(3) a. GOLD: The yogi tried the advanced asana b. PRED: The yogi tried the asana advanced c. G-triples: (the dep , yogi head , 1), (the dep , asana head , 2) d. P-triples: (the, yogi, 1), (the, asana, 1) Exact match: 1; Approximate match: 2 We then compute exact matches (the order and the distance to the head is exactly the same) and approximate matches (the order is preserved but the distance differs by 1 token 9 ). Table 2 shows the results and compares them with a nondelexicalised approach.
Global Score. The all deprels column summarises the scores for all dependency relations present in the treebanks (not just DET, NSUBJ, OBJ, AMOD and ACL). For the exact match, most languages score above average (from 0.51 to 0.71). That is the relative word order and the position of the dependent with respect to the head is correctly predicted in more than half of the cases. Approximate match yields higher scores with most languages scoring between 0.65 and 0.80 suggesting that a higher proportion of correct relative orderings is achieved (modulo mispositioning and false positives).
Long Range Dependencies. It is noticeable that for all languages, accuracy drops for the ACL relation. We conjecture that two factors makes it difficult for the model to make the correct prediction: heterogeneity and long range dependencies. As the ACL relation captures different types of clausal modifiers (finite and non-finite), it is harder for the model to learn the corresponding patterns. As the modifier is a clause, the distance between head   (the nominal being modified) and dependent (the verb of the clause modifier) can be long which again is likely to impede learning.
Irregular Order. For cases where the head/dependent order is irregular, the scores are lower. For instance, in Dutch the object may occur either before (46.9% of the cases in the test data) or after the verb depending on whether it occurs in a subordinate or a main clause. Relatedly, the OBJ exact match score is the lowest (0.38) for this language. Similarly, in Romance languages where the adjective (AMOD relation) can either be pre-(head-final construction, HF) or post-posed (head-initial construction, HI), exact match scores are lower for this relation than for the others. For instance, the Portuguese test data contains 71% HF and 29% HI occurrences of the AMOD relation and correspondingly, the scores for that relation are much lower than for the DET, NSUBJ and OBJ relations for that language. A similar pattern can be observed for Spanish, French and Italian. More detailed statistics, including other relations and performance with respect to headdirectionality, can be found in the supplementary material.
Non-delexicalised Baseline. We also compare our delexicalised model with the nondelexicalised baseline: ∆ in Table 2 shows the difference in performance between the two models.
Overall, the scores favour the delexicalised approach (negative delta in the all deprels column for all languages) supporting the results given by the automatic metric. However, for some dependency relations, the lexicalised baseline shows usefulness of word information, for ex-  Table 4: Contraction Generation Results (BLEU scores). S −c /S +c : a sentence without contractions vs. a reference sentence including contractions; S −c : BLEU with respect to sentences before contractions; S +c : BLEU with respect to a reference sentence. The scores were computed on detokenised sequences.
ample, while predicting AMOD relations for Romance languages (positive delta for French, Italian, Spanish, and Portuguese). Indeed, preposed adjectives in those languages constitute a limited lexical group. Table 3 shows the results for the WO+MR model. The top line (MR Accuracy) indicates the accuracy of the MR model on the SR'18 test data which is computed by comparing its output with gold word forms. As the table shows, the accuracy is very high overall ranging from 87.6 to 99.07, with 9 of the 10 languages having an accuracy above 90. This confirms the high accuracy of the model when performing morphological inflection out of context.

Morphological Realisation
The third line (WO+MR (S −c )) shows the BLEU scores for our WO+MR model, i.e., when the MR model is applied to the output of the WO model. Here we use an oracle setting which ignores contractions. That is, we compare the WO+MR output not with the final sentence but with the sentence before contraction applies (the ability to handle contractions is investigated in the next section).
As the table shows, the delta in BLEU scores between the model with (WO+MR) and without (WO) morphological realisation mirrors the accuracy of the morphological realisation model: as the accuracy of the morphological inflection model decreases, the delta increases. For instance, for Arabic, the MR accuracy is among the lowest (91.05) and, correspondingly, the decrease in BLEU score when going from word ordering to word ordering with morphological realisation is the largest (-6.3).

Contraction Generation
To assess the degree to which contractions are used, we compute BLEU-4 between the gold sequence of word forms from UD treebanks and the reference sentence (Table 4, Line S −c /S +c ). As the table shows, this BLEU score is very low for some languages (Arabic, French, Italian, Portuguese) indicating a high level of contractions.
These differences are reflected in the results of our WO+MR model: the higher the level of contractions, the stronger the delta between the BLEU score on the reference sentence without contractions (WO+MR, S −c ) and the reference sentence with contractions (WO+MR, S +c ).
This shows the limits of out-of-context morphological realisation. While the model is good at producing a word form given its lemma and a set of morpho-syntactic features, the lack of contextual information means that contractions cannot be handled.
Adding a contraction module permits improving results for those languages where contraction is frequent (Table 4, Lines WO+MR+C reg , WO+MR+C s2s ). Gains range from +5 points for French to +12 for Portuguese when comparing to WO+MR. We achieved better results with contraction module based on regular expressions (C reg ), rather than a neural module (C s2s ). In a relatively simple task, such as contraction generation, rulebased methods are more reliable, and, overall, are preferable due to their robustness and easy repair comparing to neural models, which may, for instance, hallucinate incorrect content.

Global Evaluation
Finally, we compare our approach with the best results obtained by the SR'18 participants and with OSU's results (King and White, 2018) using BLEU-4, DIST and NIST scores. OSU results are treated separately, since some of their scores were published after the shared task had ended.  Table 5: BLEU, DIST and NIST scores on the SR'18 test data (shallow track). SR'18 is the official results of the shared task but do not include OSU scores, since they are given in the line below. We also excluded the ADAPT and NILC scores as they were obtained using data augmentation. OSU is the submission of King and White (2018).
The languages for which our model outperforms the state of the art are languages for which the WO model performs best, the accuracy of the morphological realiser is high and the level of contractions is low. For those languages, improving the accuracy of the word ordering model would further improve results.
For four of the languages where the model underperforms (namely, Arabic, French, Portuguese and Italian), the level of contraction is high. This indicates that improvements can be gained by improving the handling of contractions, e.g., by learning a joint model that would take into account both morphological inflection and contraction.

Conclusion
While surface realisation is a key component of NLG applications, most work in this domain has focused on the development of language specific models. By providing multi-lingual training and test set, the SR'18 shared task opens up the possibility to investigate how language specific properties such as word order and morphological variation impact performance.
In this paper, we presented a modular approach to surface realisation and applied it to the ten languages of the SR'18 shallow track.
For word ordering, we proposed a simple approach where the data is delexicalised, the input tree is linearised using depth-first search and the mapping between input tree and output lemma sequence is learned using a factored sequenceto-sequence model. Experimental results show that full delexicalisation markedly improves performance. Linguistically, this confirms the intuition that the mapping between shallow dependency structure and word order can be learned independently of the specific words involved.
We further carried out a detailed evaluation of how our word ordering model performs on the ten languages of the SR'18 shallow track. While differences in annotation consistency, number of dependency relations 10 and frequency counts for each dependency relations in each dataset make it difficult to conclude anything from the differences in overall scores between languages, the evaluation of head/dependent word ordering constraints highlighted the fact that long-distance relations, such as ACL, and irregular word ordering constraints (e.g., the position of the verb in Dutch main and subordinate clauses) negatively impact results.
For morphological realisation and contractions, we showed that applying morphological realisation out of context, as is done by most of the SR'18 participating systems 11 , yields poor results for those languages (Portuguese, French, Arabic, Italian) where contractions are frequent. We explored two ways of handling contractions (a neural sequence-to-sequence model and a rule-based model) and showed that adding contraction han-dling strongly improves performance (from +5.57 to 12.13 increase in BLEU score for the rulebased model depending on the language). More generally, our work on contractions points to the need for SR models to better take into account the fine-grained structure of words. For instance, in French, the article is elided (le → l') when the following word starts with a vowel. In future work, we plan to explore the development of a joint model that simultaneously handles morphological realisation and word ordering while using finer grained word representations, such as fast-Text embeddings (Bojanowski et al., 2017) or byte pair encoding (BPE; Gage, 1994;Sennrich et al., 2016).