Transition-Based Deep Input Linearization

Traditional methods for deep NLG adopt pipeline approaches comprising stages such as constructing syntactic input, predicting function words, linearizing the syntactic input and generating the surface forms. Though easier to visualize, pipeline approaches suffer from error propagation. In addition, information available across modules cannot be leveraged by all modules. We construct a transition-based model to jointly perform linearization, function word prediction and morphological generation, which considerably improves upon the accuracy compared to a pipelined baseline system. On a standard deep input linearization shared task, our system achieves the best results reported so far.


Introduction
Natural language generation (NLG) (Reiter and Dale, 1997;White, 2004) aims to synthesize natural language text given input syntactic, semantic or logical representations. It has been shown useful in various tasks in NLP, including machine translation (Chang and Toutanova, 2007;, abstractive summarization (Barzilay and McKeown, 2005) and grammatical error correction (Lee and Seneff, 2006).
A line of traditional methods treat the problem as a pipeline of several independent steps (Bohnet et al., 2010;Wan et al., 2009;Bangalore et al., 2000;H. Oh and I. Rudnicky, 2000;Langkilde and Knight, 1998). For example, shown in Figure 1b, a pipeline based on the meaning text theory (MTT) (Melčuk, 1988) splits NLG into three * Part of the work was done when the author was a visiting student at Singapore University of Technology and Design. Note that words are replaced by their lemmas. The function word to and comma are absent in graph.
independent steps 1. syntactic generation: generating an unordered and lemma-formed syntactic tree from a semantic graph, introducing function words; 2. syntactic linearization: linearizing the unordered syntactic tree; 3. morphological generation: generating the inflection for each lemma in the string.
In this paper we focus on deep graph as input. Exemplified in Figure 2, the deep input type is intended to be an abstract representation of the meaning of a sentence. Unlike semantic input, where the nodes are semantic representations of input, deep input is more surface centric, with lem-mas for each word being connected by semantic labels (Banarescu et al., 2013;Melčuk, 2015). In contrast to shallow syntactic trees, function words in surface forms are not included in deep graphs (Belz et al., 2011). Deep inputs can more commonly occur as input of NLG systems where entities and content words are available, and one has to generate a grammatical sentence using them with only provision for inflections of words and introduction of function words. Such usecases include summarization, dialog generation etc.
A pipeline of deep input linearization is shown in Figure 1a. Generation involves predicting the correct word order, deciding inflections and also filling in function words at the appropriate positions. The worst-case complexity is n! for permuting n words, 2 n for function word prediction (assuming that a function word can be inserted after each content word), and 2 n for inflection generation (assuming two morphological forms for each lemma). On the dataset from the First Surface Realisation Shared Task, Bohnet et al. (2011) achieved the best reported results on linearizing deep input representation, following the pipeline of Figure 1b (with input as deep graph instead of semantic graph). They construct a syntactic tree from deep input graph followed by function word prediction, linearization and morphological generation. A rich set of features are used at each stage of the pipeline and for each adjacent pair of stages, an SVM decoder is defined.
Pipelined systems suffer from the problem of error propagation. In addition, because the steps are independent of each other, information available in a later stage is not made use of in the earlier stages. We introduce a transition-based (Nivre, 2008) method for joint deep input surface realisation integrating linearization, function word prediction and morphological generation. The model is shown in Fig 1c, as compared with the pipelined baseline in Fig 1a. For a directly comparable baseline, we construct a pipeline system of function words prediction, linearization and morphological generation similar to the pipeline of Bohnet et al. (2011), but with the following difference. Our baseline pipeline system makes function word prediction for a deep input graph, whereas Bohnet et al. (2011) have a preprocessing step to construct a syntactic tree from the deep input graph, which is given as input to the function word prediction module. Our pipeline is directly comparable to the joint system with regard to the use of information.
Standard evaluations show that: 1. Our joint model for deep input surface realisation achieves significantly better scores over its pipeline counterpart. 2. We achieve the best results reported on the task. Our system scores 1 BLEU point better over Bohnet et al. (2011) without using any external resources. We make the source code available at https://github.com/SUTDNLP/ ZGen/releases/tag/v0.3.

Related Work
Related work can be broadly summarized into three areas: abstract word ordering, applications of meaning-text theory and joint modelling of NLP tasks. In abstract word ordering (Wan et al., 2009;Zhang, 2013;Zhang andClark, 2015), De Gispert et al. (2014) compose phrases over individual words and permute the phrases to achieve linearization. Schmaltz et al. (2016) show that strong surface-level language models are more effective than models trained with syntactic information for the task of linearization. Transitionbased techniques have also been explored Puduppully et al., 2016). To our knowledge, we are the first to use transition-based techniques for deep input linearization.
There has been work done in the area of sentence linearization using meaning-text theory (Melčuk, 1988). Belz et al. (2011) organized a shared task on both shallow and deep linearization according to meaning-text theory, which provides a standard benchmark for system comparison.  achieved the best results for the task of shallow-syntactic linearization. Using SVM models with rich features, Bohnet et al. (2011) achieved state-of-art results on the task of deep realization. While they built a pipeline system, we show that joint models can be used to overcome limitations of the pipeline approach giving the best results. Joint models for NLP have shown effectiveness in recent years. Though having to tackle increased search space, they overcome issues with error propagation in pipelined models. Joint models have been explored for grammar-based approaches to surface realisation using HPSG and CCG (Carroll and Oepen, 2005;Velldal and Oepen, 2006;Espinosa et al., 2008;White and Rajkumar, 2009;White, 2006;Carroll et al., 1999). Joint models have been proposed for word segmentation and POS-tagging (Zhang and Clark, 2010), POS-tagging and syntactic chunking (Sutton et al., 2007), segmentation and normalization (Qian et al., 2015), syntactic linearization and morphologization (Song et al., 2014), parsing andNER (Finkel andManning, 2009), entity and relation extraction (Li and Ji, 2014) and so on. We propose a first joint model for deep realization, integrating linearization, function word prediction and morphological generation.

Baseline
We build a baseline following the pipeline in Figure 1a. Three stages are involved: 1. prediction of function words, inserting the predicted function words in the deep graph, resulting in a shallow graph; 2. linearizing the shallow graph; 3. generating the inflection for each lemma in the string.

Function Word Prediction
In the First Surface Realisation Shared Task dataset (Belz et al., 2011), there are three classes of function words to predict: to infinitive, that complementizer and comma. We implement classifiers to predict these classes of function words locally at respective positions in the deep graph resulting in a shallow graph (Figure 3). At each location the input is a node and output is a class indicating if to or that need to inserted under the node or the count of comma to be introduced under the node. Table 1 shows the feature templates for classification of to infinitives and that complementizers and Table 2 shows the feature templates for predicting the count of comma child nodes for each non-leaf node in the graph. These feature templates are a subset of features used in the joint model (Section 4) with the exceptions being word order features, which are not available here for the pipeline system, since earlier stages cannot leverage features in subsequent outcomes. We use av-Features for predicting function words including to infinitive, that complementizer WORD(n); POS(n); WORD(c) Table 1: Feature templates for the prediction of function words-to infinitive and that complementizer. Indices on the surface string: n -word index; c -child of n; Functions: WORD -word at index; POS -part-of-speech at index.
Features for predicting count of comma WORD(n); POS(n) BAG(WORD-MOD(n)) BAG(LABEL-MOD(n)) Table 2: Feature templates for the comma prediction system. Indices on the surface string: nword index; Functions: WORD -word at index; POS -part-of-speech at index; WORD-MODmodifiers of index; LABEL-MOD -dependency labels of modifiers; BAG -set. eraged perceptron classifier (Collins, 2002) to predict function words, which is consistent with the joint model.

Linearization
The next step is linearizing the graph, which we solve using a novel transition-based algorithm.

Transition-Based Tree Linearization
Liu et al. (2015) introduce a transition-based model for tree linearization. The approach extends from transition-based parsers (Nivre and Scholz, 2004;Chen and Manning, 2014), where state consists of stack to hold partially built outputs and a queue to hold input sequence of words. In case of linearization, the input is a set of words. Liu et al. therefore use a set to hold the input instead of a queue. State is represented by a tuple (σ, ρ, A), where σ is stack to store partial derivations, ρ is set of input words and A is the set of dependency relations that have been built. There are three transition actions: • SHIFT-Word-POS -shifts Word from ρ, assigns POS to it and pushes it to top of stack as S 0 ;   Table 3: Transition action sequence for linearizing the graph in Figure 3. SH -SHIFT, RA -RIGHTARC, LA -LEFTARC. POS is not shown in SHIFT actions.
The sequence of actions to linearize the set {he, goes, home} is SHIFT-he, SHIFT-goes, SHIFThome, RIGHTARC-OBJ, LEFTARC-SBJ. The full set of feature templates are shown in Table 2 of , partly shown in Table  4. The features include word(w), POS(p) and dependency label (l) of elements on stack and their descendants S 0 , S 1 , S 0,l , S 0,r etc. For example, word on top of stack is S 0 w and word on first left child of S 0 is S 0,l w. These are called configuration features. They are combined with all possible actions to score the action. Puduppully et al. (2016) extend  by redefining features to address feature sparsity and introduce lookahead features, thereby achieving highest accuracies on task of abstract word ordering.

Shallow Graph Linearization
Our transition based graph linearization system extends from Puduppully et al. (2016). In our case, the input is a shallow graph instead of a syntactic tree, and hence the search space is larger. On the other hand, the same set of actions can still be applied, with additional constraints on valid actions given each configuration (Section 3.2.3). Table 3 shows the sequence of transition actions to linearize shallow graph in Figure 3.

Obtaining Possible Transition Actions Given a Configuration
The purpose of a GETPOSSIBLEACTIONS function is to find out the set of transition actions that can lead to a valid output given a certain state.  Algorithm 3: SHIFTSUBTREE This is because not all sequences of actions correspond to a well-formed output. Essentially, given a state s = ([σ|j i], ρ, A) and an input graph C, the Decoder extracts syntactic tree from the graph (cf. In particular, if node i has direct child nodes in C, the descendants of i are shifted (line 6-7) (see Algorithm 3). Here direct child nodes (see Algorithm 2) include those child nodes of i for which i is the only parent or if there is more than one parent then every other parent is shifted on to the stack without possibility to reduce the child node. If no direct child node is in the buffer, then all graph descendants of i are shifted. Now, there are three configurations possible between i and j: 1. i and j are directly connected in C. This results in RIGHTARC or LEFTARC action; 2. i is descendant of j. In this case the parents of i (such that they are descendants of j) and siblings of i through such parents are shifted. 3. i is sibling of j. In this case, parents of i and their descendants are shifted such that A remains consistent. Because the input is a graph, more than one of the above configuration can occur simultaneously. More detailed discussion related to GETPOSSIBLEACTIONS is given in Appendix A. Unigrams S0w; S0p; S 0,l w; S 0,l p; S 0,l l; S0,rw; S0,rp; S0,rl; Bigram S0wS 0,l w; S0wS 0,l p; S0wS 0,l l; S0pS 0,l w; Linearization w0; p0; w−1w0; p−1p0; w−2w−1w0; p−2p−1p0  Table 2 of .

Feature Templates
There are three sets of features. The first is the set of baseline linearization feature templates from Table 2 in , partly shown in Table  4. The second is a set of lookahead features similar to that of Puduppully et al. (2016), shown in Table 5. 1 Parent lookahead feature in Puduppully et al. (2016) is defined for the only parent. For graph linearization, however, the parent lookahead feature need to be defined for set of parents. The third set of features in Table 6 are newly introduced for Graph Linearization. Arc left is a binary feature indicating if there is left arc between S 0 and S 1 , whereas Arc right indicates if there is a right arc. L is descendant is a binary feature indicating if L is descendant of S 0 , and L is parent or sibling indicates if it is a parent or sibling of S 0 . S 0descendants shifted is binary feature indicating if all the descendants of S 0 are shifted.
Not having POS in the input dataset, we compute the feature templates for POS making use of the most frequent POS of the lemma in the gold training data. For the features with dependency labels, we use the input graph labels.

Search and Learning
We follow Puduppully et al. (2016) and , applying the learning and search framework of Zhang and Clark (2011). Pseudocode is shown in Algorithm 4. It performs beam search holding k best states in an agenda at each incremental step. At the start of decoding, agenda holds the initial state. At a step, for each state in the set of label and POS of child nodes of L L cls ; L clns ; Lcps; Lcpns; S0wL cls ; S0pL cls ; S1wL cls ; S1pL cls ; set of label and POS of first-level siblings of L L sls ; L slns ; Lsps; Lspns; S0wL sls ; S0pL sls ; S1wL sls ; S1pL sls ; set of label and POS of parents of L L pls ; L plns ; Lpps; Lppns; S0wL pls ; S0pL pls ; S1wL pls ; S1pL pls ; Table 5: Lookahead linearization feature templates for the word L to shift. A subset is shown here. For the full feature set, refer to Table 2 of Puduppully et al. (2016). An identical set of feature templates are defined for S 0 . arc features between S0 and S1 Arc left ; Arc right ; lookahead features for L L is descendant ; L is parent or sibling ; are all descendants of S0 shifted S 0descendants shifted ; feature combination S 0descendants shifted Arc left ; S 0descendants shifted Arc right ; S 0descendants shifted L is descendant ; S 0descendants shifted L is parent or sibling ; Table 6: Graph linearization feature templates agenda, each of transition actions in GETPOSSI-BLEACTIONS is applied. The top-k states are updated in the agenda for the next step. The process repeats for 2n steps as each word needs to be shifted once on to the stack and reduced once. After 2n steps, the highest scoring state in agenda is taken as the output. The complexity of algorithm is n 2 , as it takes 2n steps to complete and during each step, the number of transition actions is proportional to ρ. Given a configuration C, the score of a possible action a is calculated as: where θ is the model parameter vector and Φ(C, a) denotes a feature vector consisting of configuration and action components. Given a set of labeled training examples, the averaged perceptron with early update (Collins and Roark, 2004) is used.

Morphological Generation
The last step is to inflate the lemmas in the sentence. There are three POS categories, including nouns, verbs and articles, for which we need to generate morphological forms. We use Wiktionary 2 as a basis and write a small set of rules 2 https://en.wiktionary.org/  similar to that used in , listed in Table 7, to generate a candidate set of inflections. An averaged perceptron classifier (Collins, 2002) is trained for each lemma. For distinguishing between singular and plural candidate verb forms, the feature templates in Table 8 are used.

Joint Method
We design a joint method for function word prediction (Section 3.1), linearization (Section 3.2) and morphological generation (Section 3.3) by further extending the transition-based system of Section 3.2, integrating actions for function word prediction and morphological generation.

Transition Actions
In addition to SHIFT, LEFTARC and RIGHTARC in Section 3.2.1, we use the following new transition actions for inserting function words: • INSERT, inserts comma at the present position; • SPLITARC-Word, splits an arc in the input graph C, inserting a function word between the words connected by the arc. Here Word specifies the function word being inserted ( Figure 5). We generate a candidate set of inflections for each lemma following the approach in Section 3.3. For each candidate inflection, we generate a corresponding SHIFT transition action. The rules in Table 7 are used to prune impossible inflections. 3  Table 9 shows the transition actions to linearize the graph in Figure 2. These newly introduced transition actions result in variability in the number of transition actions. With function word prediction, the number of transition actions for a bag of n words is not necessarily 2n-1. For example, considering an INSERT, SPLITARC-to or SPLITARC-that action post each SHIFT action, the maximum number of possible actions is 5n-1. This variance in the number of actions can impact the linear separability of state items. Following Zhu et al. (2013), we use IDLE actions as a form of padding method, which results in completed state items being further expanded up to 5n-1 steps. The joint model uses the same perceptron training al-   gorithm and similar features compared to the baseline model.

Obtaining Possible Transition Actions Given a Configuration
Given a state s = ([σ|j i], ρ, A) and an input graph C, the possible transition actions include as a subset the transition actions in Algorithm 1 for shallow graph linearization. In addition, for each lemma being shifted, we enumerate its inflections and create SHIFT transition actions for each inflection. Further, we predict SPLITARC, INSERT and IDLE actions to handle function words. If node i has a child node in C, which is not shifted, we predict SPLITARC and INSERT. If i is sibling to j, we predict INSERT. If both the stack and buffer are empty, we predict IDLE. Pseudocode for GET-POSSIBLEACTIONS for the joint method is shown in Algorithm 5.

Dataset
We work on the deep dataset from the Surface Realisation Shared Task (Belz et al., 2011) 4 . Sentences are represented as sets of unordered nodes with labeled semantic edges between them. Semantic representation is obtained by merging Nombank (Meyers et al., 2004), Propbank (Palmer et al., 2005) and syntactic dependencies. that complementizer, to infinitive and commas are omitted from the input. There are two punctuation features for information about brackets and quotes. Table 10 shows a sample training instance.
Out of 39k total training instances, 2.8k are non-projective, which we discard. We exclude instances which result in non-projective dependencies mainly because our transition actions predict only projective dependencies. It has been derived from the arc-standard system (Nivre, 2008). There are 1.8k training instances with a mismatch be-  Sem -semantic label, ID -unique ID of node within graph, PID -the ID of the parent, Attr -Attributes such as partic (participle), tense or number, Lexeme -lexeme which is resolved using wiktionary and rules in Table 7.
tween edges in the input deep graph and gold output tree. The gold output tree is the corresponding shallow tree from the shared task. We approach the task of linearization as extracting a linearized tree from the input semantic graph. So we exclude those instances which do not have edges corresponding to gold tree i.e mismatch between edges of gold tree and input graph. After excluding these instances, we have 34.3k training instances. We also exclude 800 training instances where the function words to and that have more than one child, and around 100 training instances where function words' parent and child nodes are not connected by an arc in the deep graph. The above cases are deemed annotation mistakes. We thus train on a final subset of 33.4k training instances. The development set comprises 1034 instances and the test set comprises 2398 instances. Evaluation is done using the BLEU metric (Papineni et al., 2002).
6 Development Results

Influence of Beam Size
We study the effect of beam size on the accuracies of joint model in Figure 6, by varying the beam size and comparing the accuracies on development dataset over training iterations. Beam sizes of 64 and 128 perform the best. However, beam size 128 does not improve the performance significantly, yet is twice as slow compared to a beam size 64. So we retain a 64 beam for further experiments.  Table 11: Average F-measure for function word prediction for development set.

Pipeline vs Joint Model
We compare the results of the joint model with the pipeline baseline system. Table 11 shows the development results of function word prediction, and Table 12 shows the overall development results. Our joint model of Transition-Based Deep Input Linearization (TBDIL) achieves an improvement of 5 BLEU points over the pipeline using the same feature source and training algorithm. Thanks to the sharing of word order information, the joint model improves function word prediction compared to the pipeline, which forbids such feature integration because function word prediction is the first step, taken before order becomes available. Table 13 shows the final results. The best performing system for the Shared Task was STUMABA-D by Bohnet et al. (2011), which leverages a largescale n-gram language model. The joint model TBDIL significantly outperforms the pipeline system and achieves an improvement of 1 BLEU point over STUMABA-D, obtaining 80.49 BLEU without making use of external resources. Table 14 shows sample outputs from the Pipeline system and the corresponding output from TBDIL. In the first instance, the function word to is incorrectly predicted in the arc between nodes does and yield in the pipeline system. In case of TBDIL, the n-gram feature helps avoid incorrect insertion of to which demonstrates the advantage of integrating information across stages. In the second   if it does n't yield on these matters and eventually begin talking directly to the anc Pipeline if it does not to yield on these matters and eventually begin talking directly to the anc TBDIL if it does n't yield on these matters and eventually begin talking directly to the anc ref.

Analysis
economists who read september 's low level of factory job growth as a sign of a slowdown Pipeline september 's low level of factory job growth who as a sign of a slowdown reads economists TBDIL economists who read september 's low level of factory job growth as a sign of a slowdown instance, because of incorrect linearization, there is error propagation to morphological generation in the pipeline system. In particular, economists is linearized to the object part of the sentence and the subject is singular. This, in turn, results in the incorrect prediction of morphological form of verb read as its singular variant. In TBDIL, in contrast, the joint modelling of linearization and morphology helps ordering the sentence correctly.

Conclusion
We showed the usefulness of a joint model for the task of Deep Linearization, by taking (Puduppully et al., 2016) as the baseline and extending it to perform joint graph linearization, function word prediction and morphological generation. To our knowledge, this is the first work to use Transition-Based method for joint NLG from semantic structure. Our system gave the highest scores reported for the NLG 2011 shared task on Deep Input Linearization (Belz et al., 2011).

A Obtaining possible transition actions given a configuration for Shallow Graph
During shallow linearization, a state is represented by s = ([σ|j i], ρ, A) and C is the input graph. Given C, the Decoder outputs actions which extract syntactic tree from the graph. Thus the Decoder outputs RIGHTARC or LEFTARC only if corresponding arc exists in C. The detailed pseudocode is given in Algorithm 1. If i has direct child nodes in C, the descendants of i are shifted (line 6-7) (see Algorithm 3). Here, direct child nodes (see Algorithm 2) include those child nodes of i for which i is the only parent or if there is more than one parent then every other parent is shifted on to the stack without possibility to reduce the child node. If no direct child node is in buffer, then descendants of i are shifted (line 9-10). Now, there are three configurations possible between i and j: 1. i and j are connected by arc in C. This results in RIGHTARC or LEFTARC action; 2. i is descendant of j. In this case the parents of i (such that they are descendants of j) and siblings of i through such parents are shifted. 3. i is sibling of j. In this case, the parents of i and their descendants are shifted such that A remains consistent. Additionally, because the input is a graph structure, more than one of the above configuration can occur simultaneously. We analyse the three configurations in detail below.
Since the direct child nodes of i are shifted, {j ← i} results in a LEFTARC action (line 18). Also because the input is a graph, i can be a sibling node of j. In this case, the valid parents and siblings of i are shifted. We iterate through the other elements in stack to identify the valid parents and siblings. These conditions are encapsulated in PROCESSSIBLING (line 20). Conditions for RIGHTARC are similar to that of LEFTARC with the following differences. We ensure that there is no left arc relationship for j in A (line 11). If there is a left arc relationship for j in A, it means that in an arc-standard setting, the RIGHTARC actions for j have already been made. If i is a descendant of j, valid parents and siblings of i are shifted. We iterate through the parents of i and those parents which are in turn descendants of j and not shifted on to the stack are valid parents. We shift the parent and the subtree through each such parent. These conditions are denoted by PROCESS-DESCENDANT (line 14). Consider an example to see the working of PROCESSSIBLING in detail. In PROCESSSIB-LING, we need to ensure that i is in stack because of sibling relation with j and we need to shift the valid parent nodes of i and their descendants. We call these valid nodes inflection points. Consider the following stack entries [D, A, B, C] with C as stack top. Assume that the input graph is as in Figure 7. C is sibling of B through B's parents X 11 , X 12 , X 13 . Out of these, only X 11 and X 12 are valid parents. X 13 is sibling to A through A's parent X 23 . But X 23 is in turn neither descendant of D nor sibling of D. Thus X 13 is not a valid inflection point for C. Now, X 12 is sibling of A through A's parent X 22 . X 22 is in turn sibling of D through X 32 . Thus there is a path to the stack bottom through a path of siblings/ descendant. In case of X 11 , X 11 is descendant of stack element A and is thus valid. X 11 and X 12 are called valid inflection points. If inflection point is a common parent to both S 0 and S 1 then inflection point and its descendants are shifted. Instead, if inflection point is ancestor to S 0 , then parents of S 0 (say P 0 ) which are descendants of inflection point are shifted. Additionally, descendants of P 0 are shifted.