Head-First Linearization with Tree-Structured Representation

We present a dependency tree linearization model with two novel components: (1) a tree-structured encoder based on bidirectional Tree-LSTM that propagates information first bottom-up then top-down, which allows each token to access information from the entire tree; and (2) a linguistically motivated head-first decoder that emphasizes the central role of the head and linearizes the subtree by incrementally attaching the dependents on both sides of the head. With the new encoder and decoder, we reach state-of-the-art performance on the Surface Realization Shared Task 2018 dataset, outperforming not only the shared tasks participants, but also previous state-of-the-art systems (Bohnet et al., 2011; Puduppully et al., 2016). Furthermore, we analyze the power of the tree-structured encoder with a probing task and show that it is able to recognize the topological relation between any pair of tokens in a tree.


Introduction
Surface realization is a natural language generation task that searches for the natural linear order of words given an unordered syntax tree.Often, the task is accompanied by predicting word inflection, as in two previous surface realization shared tasks (Belz et al., 2011(Belz et al., , 2018)).As morphological inflection prediction is in itself a separate task (Cotterell et al., 2016), we mainly focus on the linearization in this paper.
Syntactic linearization has been extensively studied in the literature.Earlier work mostly focuses on grammar-based approaches using different syntactic formalisms (Elhadad and Robin, 1992;Lavoie and Rainbow, 1997;Carroll et al., 1999).Recently, with the increasing availability of annotated treebanks, statistical methods gain popularity (Langkilde and Knight, 1998;Bangalore and Rambow, 2000;Filippova and Strube, 2009).
Among the most successful statistical linearization systems, Bohnet et al. (2010) employ the divide-and-conquer strategy and use beam search to incrementally find the best linearization for each subtree; Liu et al. (2015) propose a transition system akin to dependency parsing that produces a sentence that respects the given tree constraints, which is later improved by Puduppully et al. (2016) with look-ahead features.Both approaches rely on rich feature templates to capture the structural information from the input and score the (partial) output sequence, and use the perceptron to learn the parameters.Both linearizers achieve state-of-the-art performance on the Surface Realization Shared Task 2011 data (Belz et al., 2011) as part of a pipeline or joint system for the full task including deep semantic generation and word inflection (Bohnet et al., 2011;Puduppully et al., 2017).However, to the best of our knowledge, the two linearizers alone have never been directly compared.Also, they have not been tested on the data from the recent shared task (Belz et al., 2018), where they could have served as very strong baselines to put recent developments into context.Song et al. (2018) are the first to use a neural model for syntactic linearization; they adapt the neural dependency parsing model by Chen and Manning (2014) to predict transitions for linearization, which essentially replaces the perceptron with an MLP for the transition system in Liu et al. (2015).However, their adoption of neural models only takes advantage of the token-level representation such as word embeddings, while the structural information is still not well modeled.
Recently, many neural models are proposed to represent graph structures, cf.Zhou et al. (2018) for an overview.Among them, Tree-LSTM, in particular the Child-Sum variation (Tai et al., 2015), has been proposed to model (unordered) dependency trees.It differs from the sequential LSTM (Hochreiter and Schmidhuber, 1997) in that it aggregates the hidden states of multiple dependents by summation.It is in turn improved by adding the attention mechanism to the hidden states (Zhou et al., 2016), so that each dependent influences the head representation to different degrees.Miwa and Bansal (2016) propose a bidirectional extension that traverses the tree both bottom-up and top-down to allow the tokens access information from their descendants as well as ancestors.We adopt and combine their proposed models to represent the tree structure in our task, while improving the bidirectional extension by using the output of the bottom-up pass as the input for the top-down pass, so that each token can access information from all other tokens.
In most linearization models, the incremental generation algorithm follows the left-to-right sequential order.However, in the linguistic study, the head position often plays a central role in describing the constraints and optimization of word orders (Gibson, 1998;Liu, 2010;Futrell et al., 2015).In the linearization models that employ left-to-right generation, such word order properties are only implicitly reflected in the features, if at all.Inspired by the above-mentioned study on head-oriented word order constraints, we adopt an improved linearization algorithm, in which we generate the sequence starting from the head and expanding to both directions.The head-first generation order can easily capture the constraints, since it naturally separates the decision into two aspects: (1) which side of the head to append the dependent and (2) which dependent to attach closer to the head, which exactly correspond to the two aspects of the word order constraints, namely (1) the direction of the dependent and (2) the distance of dependent to the head.The algorithm is somewhat similar to He et al. (2009), which also emphasizes the central role of the head by first predicting for each dependent which side of the head it is placed.However, they exhaustively score all permutations, which could be intractable for subtrees with too many dependents, while we use incremental beam-search to guarantee the efficiency.
In this context, our contribution in this work is threefold: (1) we incorporate the tree-based representation to the linearization models; (2) we improve the linearization algorithm with plausible linguistic intuition; and (3) we conduct a compre- (1) linearization (2) inflection (3) detokenization Figure 1: Overview of the pipeline and an example of the process from an unordered dependency tree to the final sentence.
hensive comparison with several strong baselines on the recent multilingual linearization shared task data, and achieve state-of-the-art performance.

Model
We use a pipeline system for the surface realization task, consisting of three steps: linearization ( §2.1), inflection ( §2.2), and detokenization ( §2.3). Figure 1 gives an overview of the pipeline along with an example from the input tree to output text.The input is an unordered dependency tree.We first linearize the tree to obtain an ordered sequence of tokens; then inflect each lemma into the corresponding word form given the morphological information; and finally contract some words into one token and remove the empty space around some punctuation marks, obtaining the output Portuguese text "está cheia destes tesouros."(it is full of these treasures).
To encode the tokens with tree-structured information, we use a bidirectional attentive Tree-LSTM model improved upon previous work ( §2.1.1).We use a head-first decoding algorithm with beam search to order each subtree ( §2.1.2),trained with latent generation order and augmented loss ( §2.1.3).For the full surface realization task, we then use a hybrid rule-based and seq2seq model to inflect the word forms ( §2.2).Finally, we construct an automaton to contract the tokens and use an off-the-shelf detokenizer to remove extra space in the text ( §2.3).
In this paper, we mainly focus on the tree-based representation and the linearization algorithm; the inflection and detokenization models are rather simple, but also reasonably good.

Tree-Structured Encoder
We first encode each individual token in the tree by concatenating the embeddings of the lemma, universal part-of-speech (UPOS) tag, and dependency label, denoted v • .We then encode the treelevel information so that each token is aware of other tokens in the tree.
To propagate the information bottom-up from the dependents to their heads, we use a Child-Sum Tree-LSTM model (Tai et al., 2015) that sums up the hidden states of the dependents and passes them to the head.To differentiate the importance of each dependent, we apply an attention on the hidden states following Zhou et al. (2016).The output of the LSTM is the bottom-up vector for each token, denoted as v ↑ .
Following Miwa and Bansal (2016), we apply a top-down pass to propagate information from the head to the dependents.Since each dependent has only one head, unlike the bottom-up pass, we use a standard sequential LSTM to encode the paths from the root to each leaf node.For each node, we feed its bottom-up vector v ↑ into the hidden state of its head to obtain the hidden state for the current node, and the output is the top-down vector v ↓ .Miwa and Bansal (2016) perform the two passes independently, i.e., both LSTMs take v • as input and produce v ↑ and v ↓ as outputs, similar to the standard sequential bidirectional LSTM (Graves and Schmidhuber, 2005).However, two independent passes can not pass the information of all tokens to all other tokens in the tree, since each token only gets information from its ancestors and descendants, it is thus not aware of its siblings, which is crucial for the linearization.
Therefore, our model performs the bottom-up pass first, and uses its output v ↑ as the input for the top-down pass to obtain v ↓ .In this way, all tokens in the tree can be accessed by other tokens, since any two tokens have a common ancestor, and the information of one token can be first passed up to the common ancestor, then down to the other token.Figure 2

Head-First Decoder
We adopt the general divide-and-conquer strategy as in Bohnet et al. (2010), by first linearizing each subtree and then combining the ordered subtrees into a full sentence.Instead of generating the sequences from left to right as in Bohnet et al. (2010), we generate the sequence from inside out, i.e., we initialize the sequence with the head, and expand outwards by appending the dependents to the left or the right end of the sequence.This new generation order is motivated by the linguistic research on word order constraints, which largely focuses on the relative direction and distance of the dependent to the head (Gibson, 1998;Liu, 2010;Gulordava, 2018).
Following Bohnet et al. (2010), we use beam search to find the best sequence for each subtree incrementally, see the pseudocode in Algorithm 1.
We initialize the agenda with a sequence which contains only the head (line 3-4).A sequence is represented by two LSTMs, both initialized with the head representation, which corresponds to the left expansion and right expansion of the sequence.
At each step, for each sequence in the agenda, we use a pointer network (Vinyals et al., 2015) to calculate the unnormalized attention score between the left LSTM state and all the remaining tokens as the scores of attaching each token to the left (ATTEND l in line 10 1 , where v t is the vector representation of the token t), and we do the same for the right (line 14).We then create a new se-quence for each possible attachment (line 9 and 13, where ⊕ denotes concatenation), and the score of each new sequence is incremented by the attachment score (line 10 and 14).We also update the corresponding LSTM state of that sequence by adding the representation of the attached dependent as input (line 11 and 15).The new sequences are then added into the agenda for the next step (line 12 and 16).
If the number of new sequences in the new agenda is larger than the beam size, we sort the sequences and keep only the highest scoring ones for further expansion (line 19-20), and we take the highest scoring full sequence as the linearization of the subtree (line 22).Finally, when each subtree is linearized, we combine them into a full sentence as the output (line 24).We can easily modify the algorithm to left-toright and right-to-left generation orders.Since the generation only goes in one direction, we only use one LSTM state to score the expansion, and the initial sequence is an empty sequence.
The three different decoders tend to make different mistakes, since they have different starting points and could prune off the correct path in different ways.Therefore it is beneficial to combine the three decoders to vote for the best sequence.Concretely, we first shift the scores of the sequences in each beam so that the minimum score is 0 (implying that the absent sequences have neg-ative scores), then combine the sequences from the three beams by summing up the scores for identical sequences, and finally choose the highest scoring sequence in the combined beam as output.

Training with Beam Search
The head-first linearization introduces spurious ambiguity, since there may be two correct attachments (on the left or the right end of the sequence) at each step.Enforcing a canonical sequence of attachments would yield suboptimal performance.We view the order of attaching the dependents (i.e.whether to attach left or right dependent first) as latent variables, while the created sequence as the real target.We adopt the training method in Björkelund and Kuhn (2014): at every step after pruning the beam, we check if there is still at least one gold partial sequence in the beam.If not, then we calculate the hinge loss between the highest scoring gold sequence and all incorrect sequences in the beam2 .
We also follow the delayed LaSO strategy in Björkelund and Kuhn (2014): after all gold partial sequences fall out of the beam and a loss is incurred, we continue training by putting the gold sequence back into the beam, until reaching the full sequence.This is shown to be more sample efficient than the early-update strategy (Collins and Roark, 2004), since it allows the model to train on the full sequence, even if the gold path falls out of the beam early.
The standard hinge loss updates the gold sequence against the incorrect ones by enforcing a margin (typically 1), which punishes all incorrect sequences equally.However, not all incorrect sequences are equally bad in terms of the BLEU score, therefore, maintaining a larger margin for worse sequences could improve the performance.
We cannot directly use BLEU score as the margin, since it is calculated on the sentence level, while we are training on the subtree level, and the sequences in the training are often incomplete due to early-stop.Therefore, we use the inversion number as the surrogate loss for the BLEU score.
For a partial sequence, we first append the rest of the tokens to both ends of the sequence in the optimal way, then calculate the number of swaps in a bubble-sort to the gold sequence, and take the squared root as the loss3 .For example, if we have a predicted partial sequence of (1, 2, 4, 6, 7), and the remaining tokens are {3, 5}, then we first obtain the best available full sequence (3, 1, 2, 4, 6, 7, 5), then calculate the number of swaps in a bubble-sort, which is 4, and the loss value is thus 2.

Inflection
We use a simple hybrid approach for the inflection task.We first extract all inflection patterns from the training data: given the combination of lemma, UPOS, and morphological features, if there is a word form appearing more than once and has over 99% certainty, then we keep it as a rule.
For the tokens not covered by the rules, we use a seq2seq model to predict the inflection similar to Kann and Schütze ( 2016), but with a major difference.Instead of the inflected word, we predict the edit script that modifies the lemma to the word as the output sequence.The alphabet of the edit script includes all the characters in the treebank and three special symbols: to copy one input character, to delete one input character, and $ to finish generation.For example, to inflect the Portuguese verb "falar" to "falando", the output sequence is "ndo$".The advantage of predicting edit scripts instead of words is that the copy action avoids the mistake of generating incorrect but similar characters.
The edit script is in a way also similar to Bohnet et al. (2010), but they predict the full edit scripts as one tag, which results in a very large tag set for many morphologically rich languages, and makes it difficult to learn and to generalize.
We use a bidirectional LSTM to encode the characters of the lemma and the morphological tags as a sequence.Each character embedding is concatenated with a binary feature that indicates whether the corresponding input character is currently the target of the edit operation.Initially, the indicator for the first character is set to 1 and the rest are 0. A decoder LSTM with attention is then used to predict the edit operations.The input to the decoder LSTM state is the concatenation of the tree-structured token representation as in §2.1.1,the attended input vector, and the embedding of the last produced character4 .If the predicted ac-tion is copy() or delete(), the indicator on the input is then advanced by one step, and if the prediction is adding a character, the indicator does not move.

Detokenization
Since the surface realization shared task is evaluated on the generated text instead of tokens, a detokenization step is needed to compare with other participant systems.We first contract the tokenized words into one token, for example in Portuguese, the preposition "de" and determiner "estes" are contracted into one token "destes" in the text.
We extract the contraction cases from the training set, and construct an automaton to contract the tokens.Concretely, we read the tokens one by one, and when a token matches the initial state of the automaton, we store the token into a buffer and advance the automaton.If it reaches the end state, we replace the tokens in the buffer with the corresponding contracted token, otherwise we add the tokens in the buffer into the output sequence.
The final step is to remove the spaces surrounding certain punctuation marks, e.g., the period at the end of the sentence.We use a rule-based offthe-shelf tool MosesDetokenizer5 , which yields satisfactory results, compared to some other similar alternatives.

Data and Baselines
We conduct the experiments on the datasets from the previous Surface Realization Shared Task 2018 (SR18) (Belz et al., 2018), which includes 10 (mostly European) languages from the Universal Dependencies (Nivre et al., 2016).We compare our system with two state-of-theart linearization systems Bohnet et al. (2010) and Puduppully et al. (2016), referred to as B10 and P16.We run their linearization systems as is, using lemma, UPOS and dependency labels as features.We also use the same features in our system for comparison.To evaluate the linearization step alone, we calculate BLEU score based on lemma.
We also compare to the best performing systems in the SR18, where the final BLEU score is reported on the detokenized text.We execute our full pipeline of linearization, inflection, and detokenization and evaluate with the official evaluation script.We also apply our inflection and detokenization steps on the predicted linearization of B10 and P16, so that they can also be compared to other systems.

Implementation Details
Our model is implemented with the DyNet Library (Neubig et al., 2017), and is available at the first author's website6 .We use the embedding sizes of 64, 32 and 32 for lemma, UPOS and dependency labels, respectively, and the dimension for the token representation is 128.The hidden states of both bottom-up and top-down encoder LSTMs, as well as the decoder LSTMs, have dimension of 128.The decoder beam size is 32.The linearizer is quite efficient among neural models, training a medium sized treebank takes about 2 hours on a single CPU core.

Linearization
We first compare each step in our pipeline to the available baselines.For linearization, we test our models with the same tree encoding and different decoding orders (left-to-right (L2R), right-to-left (R2L), head-first (H2LR), as well as voting among the three (Vote).The results are shown in Table 1.
Among the two baseline systems, B10 performs more than 4 BLEU points higher than P16, we believe the reason is that the subtree-level beam search in B10 allows it to explore almost all possible permutations for most of the subtrees7 , while P16 directly orders the full sentence, which can only explore a fraction of the full search space even with a very large beam size.Understandably, the P16 model is designed to linearize words with partial or even no syntactic information, therefore the knowledge of subtrees cannot be assumed.However, in the scenario with full syntactic information available, B10 is clearly a better model.
We then compare our model to B10.The L2R linearizer generates the hypothesis in the same way as B10, and it uses a much smaller beam with the size of 32.It achieves 1 BLEU point higher than B10, which demonstrates the advantage of the more expressive tree-based representation.
The H2LR order performs better than L2R and R2L, which could be explained in multiple aspects.One explanation is our motivation that generating from the head could better reflect word order constraints.The other explanation is that training with latent generation order allows the model to make easier decision first, similar to the easyfirst parser by Goldberg and Elhadad (2010).
Finally, when combining the three decoders together by voting, it achieves 2 BLEU points higher than B10.There are two main reasons for this improvement: (1) multitask-style training helps regularize the parameters, and (2) different generation directions tend to prune the correct sequences at different locations, and the mistake in one direction might be saved by the other two.

Inflection
Table 2 shows the inflection performance with different models: the first model predicts edit script as a tag (EditTag); the second model predicts the character sequence of the inflected word (CharSeq); the third model predicts the edit scripts as sequences of actions (EditSeq); and the last one uses the same model as the third, but first applies the extracted rules if available (+rule).The results are compared to the reported inflection accuracy in Puzikov and Gurevych (2018) (P18), which is adapted from Aharoni and Goldberg (2017).Among our first three models, EditTag performs the lowest, mainly because of the very large tag sets in many languages (the sizes vary from around 300 for English to over 10000 for Finnish and Russian), which prevents effective learning and generalization.The CharSeq model performs much better than the EditTag, especially on the languages with very large edit scripts tag sets.The EditSeq model performs better than the character seq2seq model, mainly because the copy mechanism avoids many noisy generation errors.Some typical mistakes by CharSeq we find in English are "traveling" → "braveling" and "children" → "thildren", where some characters in the input lemma are confused with a similar one.Some typical mistake by the EditSeq are "kidding" → "kiding" and "clashes" → "clashs", where some necessary characters in the output are omitted.
Finally, the combination of the EditSeq model and extracted rules performs the best.On the development sets, the token coverage of the rules ranges from about 60% to 90% for different languages, with over 99% accuracy, which means the majority of the inflection can be produced reliably and efficiently by simply looking up in a dictionary.The hybrid approach even outperforms the very strong baseline, although the seq2seq model alone is slightly weaker than the baseline, we believe this simple trick could also benefit other inflection models.

Detokenization
As the final step, we evaluate the performance of the detokenization, which includes contracting words and attaching punctuation.We use gold linearization and inflection as the input.
We separate the evaluation into two parts: for contraction, we evaluate the token-based BLEU score against the gold contraction on the UD development set; for the punctuation attachment, we use the gold contracted word and evaluate with the official text-based BLEU score.We also evaluate the combined results where both contraction and detokenization are predicted.
Table 3 shows the results of these four scenarios.The first column contains the results of simply separating all tokens with empty spaces.The BLEU score is around 55 even when the linearization and inflection are all correct, which shows the over-proportionally large impact of the detokenization in the shared task evaluation.
Our detokenizer works reasonably well for most of the languages, except for Arabic, where both contraction and detokenization results are rather poor.We will investigate this issue in the future work, it could potentially be addressed with a edit seq2seq model similar to the inflection task but on the sentence level.

Final Results
We choose the best variant for each step in the pipeline for the full experiment, where we compare with the results from other participants in the shared task, as well as the linearizers of Bohnet et al. (2010) and Puduppully et al. (2016) combined with our inflection and detokenization models as additional baselines for the shared task.
Table 4 shows the performance of the full pipeline on the test sets.B10 and P16 are the linearizers by Bohnet et al. (2010) and Puduppully et al. (2016) combined with our inflection and detokenization model, ST18 are the best results for each language in the shared task (King and White, 2018;Puzikov and Gurevych, 2018;Ferreira et al., 2018;Elder and Hokamp, 2018).The last column contains the results of our system.
It is apparent that both B10 and P16 have higher performance than the other systems by a large margin.The advantage of our linearizer also carries over to the full pipeline, it scores 2 BLEU points higher than the best baseline.and P16) and the best system in the shared task for each language (ST18).

Relation Awareness
Our tree-based representation is theoretically able to propagate information from all other tokens in the tree.We now test whether it can really make use of such information.
We design a probing task to test whether the model can tell the relation between two tokens.Concretely, we pick two random tokens (t 1 , t 2 ) in a tree, and their relation can be described as a tuple (d 1 , d 2 ), which are the distances from t 1 and t 2 to their common ancestor.For example in Figure 2, the relation between token 4 and 8 is (1, 2).
We build a simple MLP on the concatenation of the representations of both tokens to predict the relation as a classification task.To avoid data sparsity, we only predict d 1 and d 2 up to 3, and all relations beyond this distance is classified class "too far".There are in total 4 × 4 = 16 classes.
We test the token representations with and without tree encoding in two scenarios: (1) train all parameters which tests whether the encoder architecture is able to learn the relations and (2) only train the MLP which tests whether the parametrized encoder model actually captures such relation.
Table 5 shows the accuracy of the probing task.Clearly, the representation without tree encoding can not correctly classify the relation, its accuracy is higher than chance level because the lexical information allows it to guess to some extent.The tree-structured encoder has much higher accuracy than the guessing baseline.Training on all parameters achieves higher accuracy than only training on the MLP, which suggests that the encoder architecture is able to memorize the relation of many tokens, but the linearization task does not actually require that much information.

Synergy between Encoder and Decoder
Our model uses the bidirectional Tree-LSTM to pass information both bottom-up and top-down.However, it is not yet clear whether having both directions is necessary, and how much it would influence the performance of different decoders.
Table 6 shows the average performance of the four decoders (H2LR, L2R, R2L, and combining all) combined with four possible encoders: both directions (Both), only bottom-up (BU), only topdown (TD), and only token representation without tree information (None).
When no bottom-up pass is performed (TD and None), the performance drops by a large margin, which means that the information about the dependents is very crucial for linearization.In contrast, skipping the top-down pass has much smaller influence on H2LR, while L2R and R2L also only have moderate performance drop.
Interestingly, the drop is much larger for L2R and R2L from only TD to None.The reason would be that L2R and R2L decoders treats each token equally and do not have any indication of the head if no structural information is used, while the H2LR decoder starts with the head and builds the sequence around it based on the head-oriented word order constraints.Therefore, even when there is no structural information, the prior in the H2LR decoder can still make better decisions.This also supports our intuition on the pivotal role of the head in the generation process.
Since skipping one of the passes would hurt the performances of L2R and R2L decoders, and thus also hurt the vote decoder, we use both passes for our final model, although the bottom-up pass alone suffices for the H2LR decoder.

Conclusion
We present a dependency tree linearization model with tree-structured encoder and head-first decoder, which outperforms the previous state-ofthe-art linearizers.Combined with our morphological inflection and detokenization model, it achieves the best performance on the Surface Realization Shared Task 2018 by a substantial margin.We also show that the previous work by Bohnet et al. (2010), which our decoding algorithm is based on, is still a very strong baseline.
As future work, we plan to extend the head-first linearization algorithm to (jointly) generate absent function words from the deep semantic representation.It corresponds to the deep track of the surface realization shared tasks, which is also a more realistic setting for natural language generation.

Figure 2 :
Figure 2: An illustration of the information flow in the encoder, where the red dotted arrows represent the bottom-up pass and the blue dashed arrows represent the top-down pass.The solid arrows illustrate the information flow from node 8 to node 4.

Table 3 :
Detokenization on the development set, where the contraction and punctuation steps are gold, predicted, or not used.

Table 4 :
Final results on the test set, where we compare our model to two baselines (B10

Table 5 :
Relation classification accuracy of the encoders with only token information (Token) vs. with tree information (Tree).

Table 6 :
Performance of combination of linearization orders and representations on the development set, averaged over 10 treebanks.