BinLin: A Simple Method of Dependency Tree Linearization

Surface Realization Shared Task 2018 is a workshop on generating sentences from lemmatized sets of dependency triples. This paper describes the results of our participation in the challenge. We develop a data-driven pipeline system which first orders the lemmas and then conjugates the words to finish the surface realization process. Our contribution is a novel sequential method of ordering lemmas, which, despite its simplicity, achieves promising results. We demonstrate the effectiveness of the proposed approach, describe its limitations and outline ways to improve it.


Introduction
Natural Language generation (NLG) is the task of generating natural language utterances from textual inputs or structured data representations. For many years one of the research foci in the NLG community has been Surface Realization (SR)the process of transforming a sentence plan into a linearly-ordered, grammatical string of morphologically inflected words (Langkilde-Geary, 2002).
The SR Shared Task is aimed at developing a common input representation that could be used by a variety of NLG systems to generate realizations from (Belz et al., 2011). In the case of the Surface Realization Shared Task 2018 (Mille et al., 2018) there are two different representations the contestants can use, depending on the track they participate in: Shallow Track: unordered dependency trees consisting of lemmatized nodes with part-ofspeech (POS) tags and morphological information as found in the Universal Dependencies (UD) annotations (version 2.0). 1 1 http://universaldependencies.org/ Deep Track: same as above, but having functional words and morphological features removed.
We participated in the shallow track, and therefore our task was to generate a sentence by ordering the lemmas and inflecting them to the correct surface forms. The outputs of the participating systems are assessed using both automatic and manual evaluation. The former is performed by computing BLEU (Papineni et al., 2002), NIST (Doddington, 2002), CIDEr (Vedantam et al., 2015) scores and normalized string edit distance (EDIST) between the reference sentence and a system output. Manual evaluation is based on preference judgments: third-year undergraduate students from Cambridge, Oxford and Edinburgh rate pairs of candidate outputs (including the target sentence), scoring them for Clarity, Fluency and Meaning Similarity.
The data used for the task is the UD treebanks distributed in the 10-column CoNLL-U format. 2 The data is available for Arabic, Czech, Dutch, English, Finnish, French, Italian, Portuguese, Russian and Spanish. According to the requirements of the Shallow Track, the information on word order was removed by randomized scrambling of the token sequence; the words were also replaced by their lemmas.
Our contribution is a simple method of dependency tree linearization which orders a bag of lemmas based on the available syntactic information. The major limitation of the method is its input order sensitivity; solving this problem is reserved for future work.
Our paper has the following structure. Section 2 describes related work done in the past. Section 3 presents the results of the exploratory data analysis conducted prior to system development. The details of our system architecture are specified in Section 4 which is followed by the description of the experimental setup and evaluation (Section 5). Section 6 mentions the limitations of the proposed surface realization method and outlines future work directions.

Related Work
As mentioned in Section 1, the task at hand is to generate a sentence by ordering the lemmas and inflecting them to the correct surface forms. Past research work proposed both joint and pipeline solutions for the problem. Taking into consideration the pipeline nature of our system, we separate the related work stage-wise.

Syntactic Ordering
Given a bag of input words, a syntactic ordering algorithm constructs an output sentence. Prior work explored a range of approaches to syntactic ordering: grammar-based methods (Elhadad and Robin, 1992;Carroll et al., 1999;White et al., 2007), generate-and-rerank approaches (Bangalore and Rambow, 2000;Langkilde-Geary, 2002), tree linearization using probabilistic language models (Guo et al., 2008), inter alia. Depending on how much syntactic information is available as input, the research on syntactic ordering can be categorized into (1) free word ordering, (2) full tree linearization and (3) partial tree linearization (Liu et al., 2015). The setup of the Surface Realization Task corresponds to the full tree linearization case, since the dependency tree information is provided.
Conceptually, the problem of tree linearization is simple. However, given no constraints, the search space is exponential in the number of tokens, which makes exhaustive search intractable. This stimulated the line of research focusing on the development of approximate search methods. Current state-of-the-art (evaluated on the English data only) belongs to the system of Puduppully et al. The authors treated language generation process as a generalized form of dependency parsing with unordered token sequences, and used a learning and search framework of Zhang and Clark (2011) to keep the decoding process tractable. A similar approach to dependency tree linearization was explored in (Bohnet et al., 2010), who approximated exact decoding with a beam search. Our method of syntactic ordering is also based on search ap-proximation, but follows a different approach: we use a greedy search strategy, but restrict the scoring procedure to a smaller set of plausible candidate pairs, which speeds up the search procedure and reduces the number of mistakes the system might make.

Word Inflection
Word inflection in the context of the Surface Realization Task can be defined as the subtask of generating a surface form (was) from a given source lemma (be) and additional morphological/syntactic attributes (Number=Sing,Person=3,Tense=Past).
Early work proposed to approach the task with finite state transducers (Koskenniemi, 1983;Kaplan and Kay, 1994). While being accurate, these systems require a lot of time and linguistic expertise to construct and maintain. With the advance of machine learning, the community mostly shifted towards data-driven methods of automatic morphological paradigm induction and string transduction as the method of morphological inflection generation (Yarowsky and Wicentowski, 2000;Wicentowski, 2004;Dreyer and Eisner, 2011;Durrett and DeNero, 2013;Ahlberg et al., 2015). In comparison with their rule-based counterparts, these approaches scale better across languages and domains, but require manually-defined comprehensive feature representation of the inputs.
Current research focuses on data-driven models which learn a high-dimensional feature representation of the input data during the optimization procedure in an end-to-end fashion. Recent work (Faruqui et al., 2016) proposed to model the problem as a sequence-to-sequence learning task, using the encoder-decoder neural network architecture developed in the machine translation community (Cho et al., 2014;Sutskever et al., 2014). This approach showed an improvement over conventional machine learning models, but failed to address the issue of poor sample complexity of complex neural networks -in practice, the approach did not perform well on low-resource or morphologically rich languages.
An attempt to address this issue was made by Aharoni and Goldberg (2017), who proposed to directly model an almost monotonic alignment between the input and output character sequences by using a controllable hard attention mechanism which allows the network to jointly align and transduce, while maintaining a focused representation at Language   Property  ar  cs  en  es  fi  fr  it  nl  pt  ru  unique features  37  112  36  56  89  35  41  66  48  40  OOV lemmas  1056 3299 1180 1368 1598 1895 439  973 535 2723  OOV forms  1745 8070 1313 2131 3666 2387 683 1131 785 8190  OOV chars  0  2  3  1  5 12 2 0 0 0

Data Analysis
For the input to the shallow track, the organizers separated the reference sentences from the respective structures. Although the one-to-one correspondence between sentences and dependency trees was preserved, the alignment between the lemmas in the trees and the word forms in the sentences was lost. To circumvent this issue and ease the burden of aligning lemmas with the corresponding surface forms, we decided to use the original UD data files for all our experiments -they contain the same dependency trees as the shared task data, but the order of the tokens is not scrambled and each surface form is aligned with the respective lemma.
Prior to system development, we analyzed the data along the dimensions which we deemed relevant for the task. Due to space constraints here we show figures and numbers mainly for English; the analysis results for other languages can be found in Appendix A.1.
First, we examined the lemma-to-form ratio ( Figure 1). The majority of lemmas have only one surface form, which suggests a strong majority baseline for the morphological inflection subtask. However, languages with rich morphology (Czech, Finnish, Russian) pose a challenge in this regard and call for a more elaborate approach which takes into account complex grammar inflection paradigms. The number of unique features (values in the FEAT column of the input data) served as a rough estimate of the latter (Table 1). We have not performed any language-specific engineering to address these linguistic properties, but took them min = 1, max = 18 mean = 1.24, std = 0.64 len(lemma) == 1 len(lemma) == 2 len(lemma) == 3 len(lemma) == 4 len(lemma) > 4 Figure 1: Lemma-to-form ratio (English).
into consideration for future work.
Another important data property is the length distribution of lemmas, surface forms and sentences. We computed the training data statistics and used the obtained estimates to establish cut-off thresholds for filtering out outlier lemmas and forms from the training data.
The number of out-of-vocabulary (OOV) language units can be viewed as a crude measure of the expected difference between training and development data distributions. Table 1 shows the number of OOV lemmas, surface forms and characters for each of the languages. Some of the datasets included foreign names and terms which are used in their original language forms. For example, out of 356464 French data tokens, 419 include characters that are not digits, punctuation signs or letters of the French alphabet. Since such words are usually not conjugated, but copied verbatim, we consider them as outliers and exclude them from the training procedure. Finally, tokens defined in the UD annotation guidelines as multi-word expressions (MWE) and empty nodes were excluded from the training data, because they require language-specific treatment (e.g., the French data includes 9750 tokens which were identified as MWE; out of 870033 tokens in the Russian dataset, 1092 correspond to empty nodes).
When approaching the task of syntactic ordering, min = 0, max = 19 mean = 0.94, std = 1.67 # children: 0-1 # children: 2-3 # children: 4-5 # children: 6-7 # children: 8-9 # children: > 10 one needs to take into account the complexity of the tree structures. We found the branching factor to be very informative in this regard: for each node in each tree we counted the number of children the node has. Most nodes in the dependency trees of all examined languages have one to three children ( Figure 2 shows the distribution of branching factor values for English). This solicits decomposition of the syntactic ordering procedure over subtrees, similar to what was done in (He et al., 2009).

Our Approach
This section describes the approach we developed for the shared task.
Given a dependency tree, we first decompose it into subtrees each having one head and an arbitrary number of children. Each subtree is being linearized as follows: for each of the children nodes we predict whether it should be positioned to the left or to the right of the head node, and store this positional information in a binary tree structure. We move up the original tree, linearizing subtrees until we reach the root node. At this point we have processed all nodes from the original dependency tree -it can be now completely linearized by traversing the binary tree with the root as a head node.
Since each dependency node is labeled with the corresponding lemma, it is trivial to obtain a lemma sequence from the linearized dependency tree. We further use the morphological inflection generator component to predict a surface form for each lemma in the sequence and in this way generate a sentence.

Syntactic Component
The first step of the proposed pipeline orders the nodes of the dependency tree into a sequence which ideally mirrors the order of words in the reference sentence. The main difficulty of this step is finding a sorting or ranking method which avoids making many node comparisons or scoring decisions. We propose an ordering procedure which uses a given dependency tree and constructs a binary tree storing the original dependency nodes (lemmas) in a sorted order (Algorithm 1) . As input, the algorithm takes a dependency tree and a classifier trained to make binary decisions of positioning child nodes to the right/left of the head node. First, we decompose the tree into local subtrees, represented by (head, children) node groups. This is achieved by running a breadth-first search (BFS) algorithm on the input dependency tree (line 4 of the pseudocode). For each (head, children) group, we further apply the following steps: • initialize a binary tree with the head node (line 5) • iterate over the child nodes and use the classifier to predict whether the child should be inserted to the left or to the right of the head node (lines 6-7) When the binary tree construction is finished, we can obtain a sorted lemma sequence by performing in-order traversal on the resulting binary tree.
The core of the procedure is the insertion of a new node into the binary tree (Algorithm 2). Given a node pair (n i , n j ), a classifier is used to predict whether n j should be positioned to the left or to Algorithm 1: Given a dependency tree dg and a binary classifier clf , construct a binary tree and traverse it to order dependency nodes. BFS denotes the breadth-first search procedure. return order 13: end function the right of n i . The decision is made based on the feature representation of the two nodes.
Algorithm 2: A recursive procedure of inserting a new node child into a binary tree bt, using a binary classifier clf . end if 18: end procedure For simplicity, we decided to use a multi-layer perceptron as a classifier (Figure 4).
Given a pair of nodes (n i , n j ), we first extract their features. We consider the node itself, it's head and one (any) child in the dependency tree as the neighborhood elements and extract the corresponding lemmas, POS-tags (both XPOS and UPOS), and dependency edge labels. Thus, the feature set  for one node in the node pair consists of N = 3 (neighborhood elements) × 4 (features) = 12 components. Each component is represented as a ddimensional embedding vector. The embedding matrix which contains all such vectors is denoted as E ∈ R d×|V | , where V is the vocabulary of unique lemmas, XPOS, UPOS and dependency edge labels, observed in the training data.
The embedding vectors for the two nodes under consideration are (1) concatenated to form the input to the classifier, (2) projected onto a lowerdimensional space via a linear transformation, (3) squeezed further via another linear transformation followed by applying the Leaky ReLu function (Maas et al., 2013). The last layer of the network consists of one node, followed by the sigmoid function. The decision of whether to insert node n j to the right or to the left of node n i is made according to the following rule: The neural network components were implemented using PyTorch (Paszke et al., 2017). No pretrained embedding vectors or other external resources were used for the experiments.

Morphological Component
To create a sentence from an ordered sequence of lemmas, we need to predict the correct morphological form for each of them. This is the purpose of the second component of our system. While we focused mostly on the syntactic realization component, as part of the system development we experimented with the following three different morphological inflection models:  Table 2: Evaluation of the morphological inflection system component on the original UD development set using the percentage of exact string matches as a metric. For the neural architectures, we report both case-sensitive and case-insensitive mean scores and standard deviation (averaged across ten random seed values).
• a simple multi-layer perceptron similar to the one employed for the syntactic component (MORHPMLP) • an encoder-decoder architecture with an attention mechanism of Bahdanau et al. (2014) (MORHRNNSOFT) • an encoder-decoder model with a hard monotonic attention (Aharoni and Goldberg, 2017) OOV lemmas and characters during decoding were copied without any changes.

Experimental Setup and Evaluation
Training data was filtered to exclude outliers according to the results of the data analysis (Section 3). The system components were trained separately ten times with different random seeds. In this section, we report mean scores and standard deviation for each model evaluated on the development data and averaged across the random seed values. The evaluation of the proposed approach was done both independently for each of the single components and as a whole in the pipeline mode. All the results are computed on the tokenized data instances.
Morphological component. We start with the evaluation of the morphological inflection generator, and report the exact string match accuracy for each of the tested approaches (Table 2). Two simple baselines were developed for the experiment: given a lemma, LEMMA copies the lemma itself as a prediction of the surface form, MAJOR outputs the most frequent surface form if the lemma is not an OOV item, or the lemma itself, otherwise. Lemma-form frequencies were computed on the training data. For the baselines, we report caseinsensitive scores only; the results can be easily extrapolated to the case-sensitive scenario.
As expected, the baselines are outperformed by all data-driven methods examined. Strong performance of the majority baseline for English and Dutch data can be attributed to the simpler morphology of the languages.
The best results are achieved by the model of Aharoni and Goldberg (2017)  (MORPHRNNHARD), which outperforms all other methods across all languages. Despite the fact that the approach has a bias towards languages with concatenative morphology (due to the assumption of the monotonic alignment between the input and output character sequences), it also performs well on Arabic. This model was chosen for our further pipeline experiments.
Bad sample complexity of the soft attention model (MORPHRNNSOFT) explains its inferior performance compared to the hard attention model. MORPHRNNSOFT model seems to be highly sensitive to the different values of hyperparameters; its performance has the highest standard deviation among all models, which is most likely due to the same sample complexity issue. Interestingly enough, on English, French, Italian and Dutch data the multi-layer perceptron architecture (MORPHMLP) achieves better results. The latter has a considerably simpler, but less flexible structure, which prohibits the usage of such networks for languages with rich morphology -the number of parameters needed to account for various forms and morphological features grows rapidly until the model can no longer fit into the memory. This also highlights the importance of cross-lingual evaluation of morphological analyzers and generators.
In order to better understand the most common errors made by each of the approaches (excluding the baselines), we examined the predictions of the models on the English development set. We filtered out incorrect predictions of capitalization of  the first letter of the word, because these cases are ignored by the official evaluation protocol. After the filtering, we randomly sampled one hundred erroneous predictions and manually examined them; the results are shown in Table 3. Unlike character-based models, MORPHMLP treats each surface form as an atomic unit and is therefore prone to errors caused by the data sparsity issues, failing to predict correct forms for unseen lemmas or unseen grammar patterns (wrong lemma error type). If the model correctly identifies the base form and still makes a mistake, in half of the cases it is an incorrect prediction of verb tenses, singular/plural noun forms or indefinite English articles (wrong form). The latter cases are caused by the fact that our model does not use any information about the next token when predicting the form of the current lemma. This limitation is inherent to the pipeline architecture we employed and can be accounted for in a joint morphology/syntax modeling scenario. Finally, there are also cases where a model predicts an alternative surface form which does not match the ground truth, but is grammatically correct (alt. form): "not" vs. "n't", "are" vs. "'re", "have" vs. "'ve"). Strictly speaking, the latter cases are not errors, but for simplicity we will treat them as such in this section.
MORPHRNNSOFT model predicts fewer wrong morphological variants, but suffers from another problem -hallucinating non-existing surface forms: "singed" instead of "sung", "dened" instead of "denied", "siaseme" vs. "siamese". This is not surprising, given the sequential nature of the model; usually this happens in cases with flat probability distributions over a number of possible characters following the already predicted character sequence. A large portion of such errors includes incorrect spellings of proper nouns (proper noun err): "Jersualm" vs. "Jerusalem", "Mconal" instead of "Mc-Donal". Finally, one prominent group of errors is that of incorrect digit sequences. MORPHMLP does not make these mistakes, because it uses a heuristic: OOV lemmas are copied verbatim as predictions of the surface forms.
The majority of erroneous cases for MOR-PHRNNHARD model constitute the group of alternative forms. Compared to other models, there are considerably fewer cases of predicting non-existent forms ("allergys", "goining"). The wrong form error type is mainly represented by incorrect predictions of verb forms: "sing" instead of "sung", "got" instead of "gotten", "are" instead of "'m", etc.
The results of the error analysis suggest that there is still a large room for improvement of the morphological inflection generation component. A principled approach to handling unseen tokens and a way to constrain the predictions to well-formed outputs would be interesting directions to investigate further.
Syntactic component. The syntactic component has been evaluated by computing system-level BLEU, NIST and edit distance scores (Table 4). Following the official evaluation protocol, output texts were normalized prior to computing metrics by lower-casing all tokens.
To the best of our knowledge, surface realization systems have not been evaluated on all the data used in the shared task. A simple baseline (RAND) which outputs a random permutation of the sentence tokens performs poorly across all languages. Compared to it, the 74.88% of the development data sentences ordered correctly by our method seem to indicate a good performance.
To get an idea of where our approach breaks, we sampled a few erroneous predictions and examined them manually. Generally speaking, the syntactic ordering procedure works well on the deeper tree levels, but as we move up, it gets harder to account for the many descendants a node has. An example of this error mode is given in Figure 5.
We tried to improve the prediction capabilities of the system by incorporating feature representations of the leftmost and the rightmost descendant nodes and conditioning the model on the previous pre-  Table 4: Evaluation of the syntactic ordering component on the original UD development set. We report mean scores and standard deviation for the SYNMLP model; the scores were averaged over ten models trained with different random seeds. RAND is the random baseline. The scores are case-insensitive. Every move Google makes brings this particular future closer.
Every move Google makes closer brings this particular future. Figure 5: A common error our syntactic ordering component makes. The node in the rectangle is current head, the node in the oval indicates its child for which the position prediction was incorrect. The upper sentence is the gold ordering, the one below is predicted by our system.
dictions, but this did not yield any improvements. Further investigation with regard to this issue is reserved for future work.
Full pipeline. Table 5 shows the metric evaluation results of the pipeline on the development and test data provided by the organizers (Dev-SR and Test-SR), as well as the development data from the original UD dataset, which was used in our preliminary experiments (Dev-UD).
Given the large gap between the system performance on Dev-SR and Dev-UD, we manually inspected the predictions and observed that the Dev-SR outputs were less grammatical than those made for the Dev-UD data. We investigated the issue and discovered that the morphological component worked as expected, but the syntactic ordering module was flawed. The proposed method's performance varies depending on the children nodes' order returned by the BFS procedure (line 4 of Algorithm 1). Figure 6 shows an example where our right 's folks .
That , Figure 6: An example sentence which poses a challenge to our system: "That 's right , folks ." system fails.
It is easier to determine the order of node's children starting with content words and then inserting punctuation signs; if it is the other way round, ordering tokens becomes harder. As mentioned in Section 3, we have used the original UD training and development data which contains token information in the natural order of token occurrence in the sentences. However, in the shared task data the word order information was removed by randomized scrambling of the tokens, which made it harder for the syntactic linearizer to make predictions on Dev-SR and Test-SR. Unfortunately, we did not anticipate that this will have such a great influence on the prediction capabilities of the proposed approach. We plan to investigate ways of improving it in future.

Discussion and Future Work
This section summarizes our findings and outlines perspectives for future work. The syntactic ordering component which we propose is capable of performing accurate tree linearization, but its performance varies depending on the order in which nodes are being inserted into the binary tree. Permuting the tokens randomly and training the syntactic component on scrambled token sequences seems to be the easiest way of solving the issue. However, this heuristic method does not guarantee that the model will not encounter an unseen input sequence order, in which case it could fail.  A more principled approach would be to define an adaptive model which encodes some notion of processing preference: given a set of tokens, the system should first make predictions it is most confident about, similar to easy-first dependency parsing algorithm (Goldberg and Elhadad, 2010) or imitation learning methods (Lampouras and Vlachos, 2016).
Another limitation of the proposed method is its inability to handle non-projective dependencies. This is a simplification decision we made when designing the algorithm: at each point we assume that the perfect token order can be retrieved by recursively ordering head-children subtrees, which excludes long-range crossing dependencies from consideration. By doing so we aggressively prune the search space and simplify the inference procedure, but also rule out a smaller class of more complex constructions. This might not be a problem for the English UD data, which has a small number of non-projective dependencies. However, according to the empirical study of Nivre (2006), almost 25% of the sentences in the Prague Dependency Treebank of Czech (Böhmová, Alena and Hajič, Jan and Hajičová, Eva and Hladká, Barbora, 2003), and more than 15% in the Danish Dependency Treebank (Kromann, 2003) contain non-projective dependencies. This implies that for multi-lingual surface realization such an assumption could be too strong.
Finally, another simplification which could be addressed is the decomposition of the prediction process into two separate stages of syntactic ordering and word inflection. The benefits of joint morphological inflection and syntactic ordering have been previously explored, but we found no easy way of doing so for the proposed approach. Nevertheless, it seems like a promising direction to pursue, and we plan to investigate it further.

Conclusion
In this paper, we have presented the results of our participation in the Surface Realization Shared Task 2018. We developed a promising method of syntactic ordering; evaluation results on the development data indicate that once the problem of order-sensitivity is solved, it can be successfully applied as a component in the syntactic realization pipeline.