AX Semantics’ Submission to the Surface Realization Shared Task 2018

In this paper we describe our system and experimental results on the development set of the Surface Realisation Shared Task. Our system is an entry for the Shallow-Task, with two different models based on deep-learning implementations for building the sentence combined with a rule-based morphology component.


Introduction
This paper describes our approach for the First Multilingual Surface Realisation Shared Task (Mille et al., 2018). For the surface task the dependency parse trees were given unordered and the words lemmatized. The objective was to order the words in the sentences and to inflect the given lemmas. The data was provided in 10 languages: English, Spanish, French, Portuguese, Italian, Dutch, Czech, Russian, Arabic, and Finnish. Our aim was to build new deep learning based ordering systems, augmented by using our already implemented (rule-based) morphology for the inflection part. System 1 implemented the initial idea and system 2 followed after mediocre results in system 1.
Final scoring for the MSR shared task was using System 2.

Linearization
Here we propose two systems: both are implemented using Keras (Chollet et al., 2015) and Tensorflow (Abadi et al., 2016), are trained using each language from the CoNLL data sets separately and finally also trained with all languages combined. These two systems, however, differ in their internal models (see the following two sections).
To generate training data the given training CoNLL data sets were matched to their corresponding original data using tree based matching. Each node was compared based on deprel, lemma/form, upostag, and number of children in a recursive manner traversing the tree from top to bottom.

System 1: Sequence-to-Sequence Model
System 1 is a new approach using sequence-tosequence models (Vinyals et al., 2016), encoderdecoder, and attention as described in Bahdanau et al. (2014) for machine translation. Instead of using LSTM cells, we used bidirectional GRU cells . Some early stage evaluations showed GRU converges better than LSTM for this task.
The input sequence is an unordered list of words and their features; the features for each word consist of: id, upostag, deprel, head-id, head-upostag, head-deprel, and level in the syntax-tree. All features are encoded in embeddings. The embeddings are shared between the two matching fields (i.e. deprel and head-deprel). Figure 1 shows a visualization of the model.
The result of the sequence model is a sequence of correct positions of the words for a complete sentence. This order, together with the given lemma and features from the data set, is then processed by a morphology component, which also takes care of building the "final readable sentence" including e.g. capitalization.
We trained two sub-models for each language with the sequence lengths of 25 and 400. We chose these values based on the length of the sentences in the training data set -the 75% quantile is at length 25 which includes most of the sentences. 400 is the absolute maximal length of sentences (the longest sentence has 398 words and is I n p u t L a y e r E m b e d d i n g I n p u t L a y e r E m b e d d i n g I n p u t L a y e r I n p u t L a y e r E m b e d d i n g I n p u t L a y e r I n p u t L a y e r I n p u t L a y e r in Arabic). We used 0 value padding for sequences shorter than the maximum given by the model. These two sub-models are then available for the prediction phase, during which the model is chosen by the length of input sentence being shorter than that of the next fitting model. The predicted sequence probabilities are selected so that every word appears only once in the final sentence.
Automatic evaluation of the dev-set resulted in BLEU scores and DIST scores given in table 1. We used the evaluation code given by the shared task organizers. This evaluation step includes the morphology described in section 3. We used the matching model for the language and a model trained with all languages.

System 2: Pairwise Classification
The second system is a classification model that calculates the word ordering by estimating if word1 is right of word2. Each word of a sen-tence is calculated against every other word in the same sentence. Features used in training for each of the two words are upostag, deprel, head-upostag, head-deprel and level in the syntax-tree. Same as System 1 the embeddings are shared between the two matching fields. The predicted word1-is-right-of-word2 probabilities are used for each subtree to find the order. On the next level the subtree is ordered by the probability of the head node of the subtree.

Morphologization
The morphology step employs the NLG system from AX Semantics (Weißgraeber and Madsack, 2017). That system is rule-based and for each inflection request it runs through a decision chain, in which all parts of speech and corresponding grammatical features of the specific languages are implemented.
For irregular words the AX Semantics NLG system uses lexicon entries, which always supercede the rule-based inflection. Grammatical features like number, case, animacy and tense are implemented in a general way, then added to each language alongside its individual configuration.
Since the CoNLL features differ from our usual input parameters, some preprocessing was necessary to map the terms accordingly. The words were also cleaned with regard to special characters like hash tags or diacritics before they were processed by the NLG morphology component.
The accuracy of the morphology component was tested separately on the dev-set for each language. Results are summarized in table 4. Most of the languages show a decent accuracy score of over 90%, whereas Arabic and Finnish with their more complicated morphology still achieve around 80%.
The table also shows that for some languages the accuracy scores for verbs are significantly lower than for nouns or adjectives. For example, in case of Dutch this happens mainly because a given lemma is not the infinitive form as expected from our system but a finite verb form (3rd person singular) and first has to be transformed to the infinitive. This can largely be attributed to the specialization of the system for the language of commerce, which results in a partial under-coverage  of certain language features for edge cases. We expect coverage to increase as usage expands to more fields. Furthermore, some of the errors are due to the data being erroneous or incomplete (e.g., only case is given, when number and animacy would also be needed).

Conclusion and Future Work
On the whole, none of the systems solve the task satisfactorily.
System 2 shows better scores and somewhat improved readability in contrast to System 1. See table 3 for illustration.
In both linearization systems, we use neither the lemma nor an embedding of the lemma to allow a comparison between the language models and the ALL-language model. This serves as a baseline for comparison against systems where languagespecific features can be added.
Our focus for this workshop was to build a linearization system that is simple and does not receive any topic-specific or language-specific input data nor configurations, and without building a neuronal network for morphologization. For pure morphologization tasks, especially for Finnish, Arabic and Hungarian with a large list of very rare cases, we will improve inflection by adding a NNbased morphology component as well.