Surface Realization Shared Task 2019 (MSR19): The Team 6 Approach

This study describes the approach developed by the Tilburg University team to the shallow track of the Multilingual Surface Realization Shared Task 2019 (SR’19) (Mille et al., 2019). Based on Ferreira et al. (2017) and on our 2018 submission Ferreira et al. (2018), the approach generates texts by first preprocessing an input dependency tree into an ordered linearized string, which is then realized using a rule-based and a statistical machine translation (SMT) model. This year our submission is able to realize texts in the 11 languages proposed for the task, different from our last year submission, which covered only 6 Indo-European languages. The model is publicly available.


Introduction
This study presents the approach developed by the Tilburg University team for the shallow track of the Multilingual Surface Realization Shared Task 2019 (SR'19) (Mille et al., 2019). Given a lemmatized dependency tree without word order information, the goal of this task consists of linearizing the lemmas in the correct order and realizing them as a surface string with the proper morphological form.
Our approach is similar to our submission for the 2018 version of the shared-task (Ferreira et al., 2018). It is based on the surface realization approach described in Ferreira et al. (2017), where a semantic graph structure is first preprocessed into a preordered linearized form, which is subsequently converted into text using a Statistical Machine Translation (SMT) model implemented in Moses (Koehn et al., 2007). The difference is that, instead of a semantic structure, our approach preprocesses the lemmas of the dependency tree into an ordered linearized version, which is then converted into text using rules and an SMT model. Different from our last submission where our approach covered only some of the proposed languages (6 out of 10), this year it is able to generate text in all of the 11 languages proposed in the shared-task: Arabic, Chinese, English, French, Hindi, Indonesian, Japanese, Korean, Portuguese, Russian and Spanish. For these languages, parallel datasets were provided with alignment information between source and target sides.
Regarding the languages covered in the previous version of the shared-task, our submission introduced promising results for English, French, Portuguese and Spanish, with BLEU scores higher than 40. For the newly covered languages, results appear promising for realizing Hindi and Indonesian output, with BLEU scores higher than 50. However, the approach appeared to work poorly for Arabic and Russian, and had problems to generate texts in the Asian languages Chinese, Japanese and Korean.
In the remainder of this paper, we better describe our method: Section 2 describes the general approach, Section 3 describes the results and discussion of our approach and Section 4 concludes the study, also describing future work which can be done to improve the model.

Model
Following our submission of last year (Ferreira et al., 2018), our model is based on the NLG approach introduced in Ferreira et al. (2017), where a semantic graph structure is first preprocessed into a preordered linearized form, which is then converted into its textual counterpart using an SMT model implemented with Moses. However for this task, instead of a semantic structure, our approach takes as input a lemmatized dependency tree, which is linearized and converted into its final version by a rule-based and an SMT model. In the next sections, we explain the linearization and realization phases in more detail.

Linearization
This method aims to linearize a dependency tree input without punctuation nodes into an ordering string format. Our approach is similar to the 2step classifier introduced in Ferreira et al. (2017) and is depicted in Algorithm 1.
The approach starts by deciding which firstorder child nodes are most likely to be before and after its head node (lines 1-13). It uses a maximum entropy classifier φ 1 , trained for each language based on the relevant aligned training set. As features, this classifier uses the lemmas as well as the dependency and part-of-speech tags of the head and child nodes.
Once the nodes are split into a group of nodes before and another group of nodes after their heads, each one of these groups is ordered with an algorithm similar to the MergeSort one (lines 14-24 and function SORT ). To decide the order of two child nodes of a same group, we use a second maximum entropy classifier φ 2 , also trained for each language based on the corresponding aligned training set. As features (line 44), it uses the lemmas as well as the dependency and part-of-speech tags of the head and the two child nodes involved in each comparison.

Realization
Once the dependency trees are linearized, two methods were used to surface realize the lemmas: a rule-based and a statistical machine translation (SMT) model.

Rule-based
For all the 11 covered languages, this approach uses a lexicon created based on the aligned information extracted from the datasets. Given a lemma and its features, our approach looks for the most frequent morphological form in the lexicon.
The settings were copied from the Statistical MT system introduced in Ferreira et al. (2017). At training time, we extract and score phrases up to the size of nine tokens. As feature functions, we used direct and inverse phrase translation probabilities and lexical weighting, as well as word, unknown word and phrase penalties. These feature functions were trained using alignments from the training set obtained by MGIZA (Gao and Vogel, 2008). Model weights were tuned on the development data using 60-batch MIRA (Cherry and Foster, 2012) with BLEU as the evaluation metric. A distortion limit of 6 was used for the reordering models. We used two lexicalized reordering models: a phrase-level (phrase-msd-bidirectional-fe)  and a hierarchical-level one (hier-mslr-bidirectional-fe) (Galley and Manning, 2008). At decoding time, we used a stack size of 1000. To rerank the candidate texts, we used a 5gram language model trained on the EuroParl corpus (Koehn, 2005) using KenLM (Heafield, 2011).

Results and Discussion
Concerning the languages covered in the previous version of the shared-task, our approach introduced promising results for English, French, Portuguese and Spanish, with BLEU scores higher than 40. For the newly covered languages, results were promising for the realization of Hindi and Indonesian texts, with BLEU scores higher than 50. On the other hand, our approach obtained low results for Arabic and Russian, and had problems to generate texts in the Asian languages Chinese, Japanese and Korean. For Chinese and Japanese, the problem arose from the fact we did not manage the tokenization/detokenization process well, which had a drastic negative influence on the final results.

Conclusion
This study described a shallow surface realizer for the 11 target languages in the Surface Realization Shared Task 2019 (SR'19). In future work, we aim to fix the problems for the Asian languages Chinese, Japanese and Korean. Specifically, for Chinese and Japanese, we require a proper method to tokenize/detokenize the results produced by our approach. Moreover, we aim to design the task based on novel pipeline architectures for Natural Language Generation (Ferreira et al., 2019).