Neural Transition-based Syntactic Linearization

The task of linearization is to find a grammatical order given a set of words. Traditional models use statistical methods. Syntactic linearization systems, which generate a sentence along with its syntactic tree, have shown state-of-the-art performance. Recent work shows that a multilayer LSTM language model outperforms competitive statistical syntactic linearization systems without using syntax. In this paper, we study neural syntactic linearization, building a transition-based syntactic linearizer leveraging a feed forward neural network, observing significantly better results compared to LSTM language models on this task.


Introduction
Linearization is the task of finding the grammatical order for a given set of words. Syntactic linearization systems generate output sentences along with their syntactic trees. Depending on how much syntactic information is available during decoding, recent work on syntactic linearization can be classified into abstract word ordering (Wan et al., 2009;Zhang et al., 2012;de Gispert et al., 2014), where no syntactic information is available during decoding, full tree linearization (He et al., 2009;Bohnet et al., 2010;, where full tree information is available, and partial tree linearization (Zhang, 2013), where partial syntactic information is given as input. Linearization has been adapted to tasks such as machine translation , and is potentially helpful for many NLG applications, such as cooking recipe generation (Kiddon et al., 2016), dialogue response generation (Wen et al., 2015), and question generation (Serban et al., 2016).
Previous work (Wan et al., 2009; has shown that jointly predicting the syntactic tree and the surface string gives better results by allowing syntactic information to guide statistical linearization. On the other hand, most such methods employ statistical models with discriminative features. Recently, Schmaltz et al. (2016) report new state-of-the-art results by leveraging a neural language model without using syntactic information. In their experiments, the neural language model, which is less sparse and captures long-range dependencies, outperforms previous discrete syntactic systems.
A research question that naturally arises from this result is whether syntactic information is helpful for a neural linearization system. We empirically answer this question by comparing a neural transition-based syntactic linearizer with the neural language model of Schmaltz et al. (2016). Following , our linearizer works incrementally given a set of words, using a stack to store partially built dependency trees, and a set to maintain unordered incoming words. At each step, it either shifts a word onto the stack, or reduces the top two partial trees on the stack. We leverage a feed forward neural network, which takes stack features as input and predicts the next action (such as SHIFT, LEFTARC and RIGHTARC). Hence our method can be regarded as an extension of the parser of Chen and Manning (2014), adding word ordering functionalities.
In addition, we investigate two methods for integrating neural language models: interpolating the log probabilities of both models and integrating the neural language model as a feature. On standard benchmarks, our syntactic linearizer gives results that are higher than the LSTM language model of Schmaltz et al. (2016) by 7 BLEU points (Papineni et al., 2002) using greedy search, and the gap can go up to 11 BLEU points by integrating the LSTM language model as features. The integrated system also outperforms the LSTM language model by 1 BLEU point using beam search, which shows that syntactic information is useful for a neural linearization system.

Related work
Previous work (White, 2005;White and Rajkumar, 2009;Zhang and Clark, 2011;Zhang, 2013) on syntactic linearization uses best-first search, which adopts a priority queue to store partial hypotheses and a chart to store input words. At each step, it pops the highest-scored hypothesis from the priority queue, expanding it by combination with the words in the chart, before finally putting all new hypotheses back into the priority queue. As the search space is huge, a timeout threshold is set, beyond which the search terminates and the current best hypothesis is taken as the result.  adapt the transition-based dependency parsing algorithm for the linearization task by allowing the transition-based system to shift any word in the given set, rather than the first word in the buffer as in dependency parsing. Their results show much lower search times and higher performance compared to Zhang (2013). Following this line,  further improve the performance by incorporating an n-gram language model. Our work takes the transition-based framework, but is different in two main aspects: first, we train a feed-forward neural network for making decisions, while they all use perceptron-like models. Second, we investigate a light version of the system, which only uses word features, while previous works all rely on POS tags and arc labels, limiting their usability on low-resource domains and languages. Schmaltz et al. (2016) are the first to adopt neural networks on this task, while only using surface features. To our knowledge, we are the first to leverage both neural networks and syntactic features. The contrast between our method and the method of Chen and Manning (2014) is reminiscent of the contrast between the method of  and the dependency parser of Zhang and Nivre (2011). Comparing with the dependency parsing task, which assumes that POS tags are available as input, the search space of syntactic linearization is much larger.
Recent work (Zhang, 2013; on syntactic linearization uses dependency grammar. We follow this line of works. On the other hand, linearization with other syntactic grammars, such as context free grammar (de Gispert et al., 2014) and combinatory categorial grammar (White and Rajkumar, 2009;Zhang and Clark, 2011), has also been studied.

Task
Given an input bag-of-words x = {x 1 , x 2 , ..., x n }, the goal is to output the correct permutation y, which recovers the original sentence, from the set of all possible permutations Y. A linearizer can be seen as a scoring function f over Y, which is trained to output its highest scoring permutation y = argmax y ∈Y f (x, y ) as close as possible to the correct permutation y.

Baseline: an LSTM language model
The LSTM language model of Schmaltz et al. (2016) is similar to the medium LSTM setup of Zaremba et al. (2014). It contains two LSTM layers, each of which has 650 hidden units and is followed by a dropout layer during training. The multi-layer LSTM language model can be represented as: where h t,i and c t,i are the output and cell memory of the i-th layer at step t, respectively, h t,0 = x t is the input of the network at step t, I is the number of layers, w t,j represents outputting w j at t step, v j is the embedding of w j , and the LSTM function is defined as: where σ is the sigmoid function, W 4n,2n is the weights of LSTM cells, and is the element-wise product operator. Figure 1 shows the linearization procedure of the baseline system, when taking the bag-of-words {"NLP","love","I"} as input. At each step, it takes the output word from the previous step as input and predicts the current word, which is chosen from the remaining input bag-of-words rather than from the entire vocabulary. Therefore it takes n steps to linearize a input consisting of n words.

Neural transition-based syntactic linearization
Transition-based syntactic linearization can be considered as an extension to transition-based dependency parsing , with the main difference being that the word order is not given in the input, so that any word can be shifted at each step. This leads to a much larger search space. In addition, under our setting, extra dependency relations or POS on input words are not available. The output building process is modeled as a state-transition process. As shown in Figure 2, each state s is defined as (σ, ρ, A), where σ is a stack that maintains a partial derivation, ρ is an unordered set of incoming input words and A is the set of dependency relations that have been built. Initially, the stack σ is empty, while the set ρ contains all the input words, and the set of dependency relations A is empty. At the end, the set ρ is empty, while A contains all dependency relations for the predicted dependency tree. At a certain state, a SHIFT action chooses one word from the set ρ and pushes it onto the stack σ, a LEFT-ARC action makes a new arc {j ← i} from the stack's top two items (i and j), while a RIGHTARC action makes a new arc {j → i} from i and j. Using these possible actions, the unordered word set {"NLP 0 ","love 1 ","I 2 "} is linearized as shown in Table 1, and the result is "I 2 ← love 1 → NLP 0 ". 1 1 For a clearer introduction to our state-transition process, we omit the POS-p actions, which are introduced in Section 4.2. In our implementation, each SHIFT-w is followed by exact one POS-p action.  Table 1: Transition-based syntactic linearization for ordering {"NLP 3 ","love 2 ","I 1 "}, where RArc and LArc are the abbreviations for RightArc and LeftArc, respectively. More details on actions are in Section 4.2.

Model
To predict the next transition action for a given state, our linearizer makes use of a feed-forward neural network to score the actions as shown in Figure 3. The network takes a set of word, POS tag, and arc label features from the stack as input and outputs the probability distribution of the next actions. In particular, we represent each word as a d-dimensional vector e w i ∈ R d using a word embedding matrix is E w ∈ R d×Nw , where N w is the vocabulary size. Similarly each POS tag and arc label are also mapped to a d-dimensional vector, where e t j , e l k ∈ R d are the representations of the j-th POS tag and k-th arc label, respectively. The embedding matrices of POS tags and arc labels are E t ∈ R d×Nt and E l ∈ R d×N l , where N t and N l correspond to the number of POS tags and arc labels, respectively. We choose a set of feature words, POS tags, and arc labels from the stack context, using their embeddings as input to our neural network. Next, we map the input layer to the hidden layer via:  Figure 3: Neural syntactic linearization model where x w , x t , and x l are the concatenated feature word embeddings, POS tag embeddings, and arc label embeddings, respectively, W w 1 , W t 1 , and W l 1 are the corresponding weight matrices, b 1 is the bias term and g() is the activation function of the hidden layer. The word, POS tag and arc label features are described in Section 4.3.
Finally, the hidden vector h is mapped to an output layer, which uses a softmax activation function for modeling multi-class action probabilities: where p(a|s, θ) represents the probability distribution of the next action. There is no bias term in this layer and the model parameter W 2 can also be seen as the embedding matrix of all actions.

Actions
We use 5 types of actions: • SHIFT-w pushes a word w onto the stack.
• POS-p assigns a POS tag p to the newly shifted word.
• LEFTARC-l pops the top two items i and j off stack and pushes {j l ← − i} onto the stack.
• RIGHTARC-l pops the top two items i and j off stack and pushes {j l − → i} onto the stack.
• END ends the decoding procedure.
Given a set of n words as input, the linearizer takes 3n steps to synthesize the sentence. The number of actions is large, making it computationally inefficient to do softmax over all actions. Here for each set of words S we only consider all possible actions for linearizing the set, which constraints SHIFT-w i to all words in the set.

Features
The feature templates our model uses are shown in Table 2. We pick (1) the words and POS tags of the top 3 items on the stack, (2) the words, POS tags, and arc labels of the first and the second leftmost / rightmost children of the top 2 items on the stack and (3) the words, POS tags and arc labels of the leftmost of leftmost / rightmost of rightmost children of the top two items on the stack. Under certain states, some features may not exist, and we use special tokens NULL w , NULL t and NULL l to represent non-existent word, POS tag, and arc label features, respectively. Our feature templates are similar to that of Chen and Manning (2014), except that we do not leverage features from the set, because the words inside the set are unordered.

The light version
We also consider a light version of our linearizer that only leverages words and unlabeled dependency relations. Similar to Section 4.1, the system also uses a feed-forward neural network with 1 hidden layer, but only takes word features as input. It uses 4 types of actions: SHIFT-w, LEFT-ARC, RIGHTARC, and END. All actions are same as described in Section 4.2, except that LEFTARC and RIGHTARC are not associated with arc labels. Given a set of n words as input, the system takes 2n steps to synthesize the sentence, which is faster and less vulnerable to error propagation.

Integrating an LSTM language model
Our model can be integrated with the baseline multi-layer LSTM language model. Existing work (Zhang et al., 2012; has shown that a syntactic linearizer can benefit from a surface language model by taking its scores as features. Here we investigate two methods for the integration: (1) joint decoding by interpolating the conditional probabilities and (2) featurelevel integration by taking the output vector (h I ) of the LSTM language model as features to the linearizer.

Joint decoding
To perform joint decoding, the conditional action probability distributions of both models given the current state are interpolated, and the best action under the interpolated probability distribution is chosen, before both systems advancing to a new state using the action. The interpolated conditional probability is: where s i and θ 1 are the state and parameters of the linearizer, h i and θ 2 are the state and parameters of the LSTM language model, and α is the interpolation hyper parameter. The action spaces of the two systems are different because the actions of the LSTM language model correspond only to the shift actions of the linearizer. To match the probability distributions, we expand the distribution of the LSTM language model as shown in Equation 9, where w a is the associated word of a shift action a. Generally, the probabilities of non-shift actions are 1.0, and those of shift actions are from the LSTM language model with respect to w a : We do not normalize the interpolated probability distribution, because our experiments show that normalization only gives around 0.3 BLEU score gains, while significantly decreasing the speed. When a shift action is chosen, both systems advance to a new state; otherwise only the linearizer advances to a new state.

Feature level integration
To take the output of an LSTM language model as a feature in our model, we first train the LSTM language model independently. During the training of our model, we take h I , the output of the top LSTM layer after consuming all words on the stack, as a feature in the input layer of Figure 3, before finally advancing both the linearizer and the LSTM language model using the predicted action. This is analogous to adding a separately-trained n-gram language model as a feature to a discriminative linearizer . Compared with joint decoding (Section 5.1), p(a|s i , h i ; θ 1 , θ 2 ) is calculated by one model, and thus there is no need to tune the hyper-parameter α. The state update remains the same: the language model advances to a new state only when a shift action is taken.

Training
Following Chen and Manning (2014), we set the training objective as maximizing the loglikelihood of each successive action conditioned on the dependency tree, which can be gold or automatically parsed. To train our linearizer, we first generate training examples {(s i , t i )} m i=1 from the training sentences and their gold parse trees, where s i is a state, and t i ∈ T is the corresponding oracle transition. We use the "arc standard" oracle (Nivre, 2008), which always prefers SHIFT over LEFTARC. The final training objective is to minimize the cross-entropy loss, plus an L2regularization term: where θ represents all the trainable parameters: A slight variation is that the softmax probabilities are computed only among the feasible transitions in practice. As described in Section 4.2, for an input set of words, the feasible transitions are: SHIFT-w, where w is a word in the set, POS-p for all POS tags, LEFTARCl and RIGHTARC-l for all arc labels, and END.
To train a linearizer that takes an LSTM language model as features, we first train the LSTM language model on the same training data, then train the linearizer with the parameters of the LSTM language model unchanged.   we use ten-fold jackknifing to construct WSJ training data with different accuracies. More specifically, the data is first randomly split into ten equalsize subsets, and then each subset is automatically parsed with a constituent parser trained on the other subsets, before the results are finally converted to dependency trees using Penn2Malt. In order to obtain datasets with different parsing accuracies, we randomly sample a small number of sentences from each training subset and choose different training iterations, as shown in Table 4.
In our experiments, we use ZPar 3 (Zhu et al., 2013) for automatic constituent parsing. Our syntactic linearizer is implemented with Keras. 4 We randomly initialize E w , E t , E l , W 1 and W 2 within (−0.01, 0.01), and use default setting for other parameters. The hyper-parameters and parameters which achieve the best performance on the development set are chosen for final evaluation. Our vocabulary comes from SENNA 5 , which has 130,000 words. The activation functions tanh and softmax are added on top of the hidden and output layers, respectively. We use Adagrad (Duchi et al., 2011) with an initial learning rate of 0.01, regularization parameter λ = 10 −8 , and dropout rate 0.3 for training. The interpolation coefficient α for joint decoding is set 0.4. During decoding, simple pruning methods are applied, such as a constraint that POS-p actions always follow SHIFT-w actions.
We evaluate our linearizer (SYN) and its variances, where the subscript " l " denotes the light version, "+LSTM" represents joint decoding with an LSTM language model, and "×LSTM" represents taking an LSTM language model as features in our model. We compare results with the current state-of-the-art: an LSTM (LSTM) language model from Schmaltz et al. (2016), which is similar in size and architecture to the medium LSTM setup of Zaremba et al. (2014). None of the systems use future cost heuristic. All experiments are conducted using Tesla K20Xm.

Tuning
We show some development results in this section. First, using the cube activation function (Chen and Manning, 2014) does not yield a good performance on our task. We tried other activations including Linear, tanh and ReLU (Nair and Hinton, 2010), and tanh gives the best results. In addition, we tried pretrained embeddings from SENNA, which does not yield better results compared to random initialization. Further, dropout rates from 0.3 to 0.8 give good training results. Finally, we tried different values from 0.1 to 1.0 for the interpolation coefficient α, finding that values between 0.3 and 0.7 give the best performances, while values larger than 1.5 yield poor performances.

Main results
The main results on the test set are shown in Table 3. Compared with previous work, our linearizers achieve the best results under all beam sizes, especially under the greedy search scenario (BEAMSIZE=1), where SYN and SYN×LSTM outperform the baseline of LSTM by 7 and 11 BLEU points, respectively. This demonstrates that syntactic information is extremely important when beam size is small. In addition, our syntactic systems are still better than the baseline under very large beam sizes (such as, BEAMSIZE=512), which lead to slow performance and are less useful practically. On the other hand, the baseline (LSTM) benefits more from beam size increases. shearson lehman hutton inc. said , however , that it is " going to set back with the customers , " because of friday 's plunge , president of jeffrey b. lane concern " reinforces volatility relations . SYN l ×LSTM-512 however , jeffrey b. lane , president of shearson lehman hutton inc. , said that friday 's plunge is " going to set back with customers because it reinforces the volatility of " concern , " relations . REF however , jeffrey b. lane , president of shearson lehman hutton inc. , said that friday 's plunge is " going to set back " relations with customers , " because it reinforces the concern of volatility . LSTM-512 the debate between the stock and futures markets is prepared for wall street will cause another situation about whether de-linkage crash undoubtedly properly renewed friday . SYN l ×LSTM-512 the wall street futures markets undoubtedly will cause renewed debate about whether the stock situation is properly prepared for an other crash between friday and de-linkage . REF the de-linkage between the stock and futures markets friday will undoubtedly cause renewed debate about whether wall street is prope rly prepared for another crash situation . The results are consistent with (Ma et al., 2014) in that both increasing beam size and using richer features are solutions for error propagation.
SYN×LSTM is better than SYN+LSTM. In fact, SYN×LSTM can be considered as interpolation with α being automatically calculated under different states. Finally, SYN l ×LSTM is better than SYN×LSTM except under greedy search, showing that word-to-word dependency features may be sufficient for this task.
As for the decoding times, SYN l ×LSTM shows a moderate time growth along increasing beam size, which is roughly 1.5 times slower than LSTM. In addition, SYN+LSTM and SYN×LSTM are the slowest for each beam size (roughly 3 times slower than LSTM), because of the large number of features they use and the large number of decoding steps they take. SYN is roughly 2 times slower than LSTM.
Previous work, such as Schmaltz et al. (2016), adopts future cost and the information of base noun phrase (BNP) and shows further improvement on performance. However, these are highly task specific. Future cost is based on the assumption that all words are available at the beginning, which is not true for other tasks. On the other hand, our model does not rely on this assumption, thus can be better applicable on other tasks. BNPs are the phrases that correspond to leaf NP nodes in constituent trees. Assuming BNPs being available is not practical either.

Influence of sentence length
We show the performances on different sentence lengths in Figure 4. The results are from LSTM and SYN l ×LSTM using beam size 1 and 512. Sentences belonging to the same length range (such as 1-10 or 11-15) are grouped together, and corpus BLEU is calculated on each group. First of all, SYN l ×LSTM-1 is significantly better than LSTM-1 on all sentence lengths, explaining the usefulness of syntactic features. In addition, SYN l ×LSTM-512 is notably better than LSTM-512 on sentences that are longer than 25, and the difference is even larger on sentences that have more than 35 words. This is an evidence that SYN l ×LSTM is better at modeling long-distance dependencies. On the other hand, LSTM-512 is better than SYN l ×LSTM-512 on short sentences (length ≤ 10). The reason may be that LSTM is  good at modeling relatively shorter dependencies without syntactic guidance, while SYN l ×LSTM, which takes more steps for synthesizing the same sentence, suffers from error propagation. Overall, this figure can be regarded as empirical evidence that syntactic systems are better choices for generating long sentences (Wan et al., 2009;Zhang and Clark, 2011), while surface systems may be better choices for generating short sentences. Table 5 shows some linearization results of long sentences from LSTM and SYN l ×LSTM using beam size 512. The outputs of SYN l ×LSTM are notably more grammatical than those of LSTM. For example, in the last group, the output of SYN l ×LSTM means "the market will cause another debate about whether the situation now is prepared for another crash", while the output of LSTM is obviously less fluent, especially for the parts "... markets is prepared for wall street will cause ..." and "... crash undoubtedly properly renewed ..".
In addition, LSTM makes locally grammatical outputs, while suffering more mistakes in the global level. Taking the second group as an example, LSTM generates grammatical phrases, such as "going to set back with the customers" and "because of friday 's plunge", while misplacing "president of", which should be in the very front of the sentence. On the other hand, SYN l ×LSTM can capture patterns such as "president of some inc." and "someone, president of someplace said" to make the right choices. Finally, SYN l ×LSTM can makes grammatical sentences with different meanings. For example in the first group, the result of SYN l ×LSTM means "the bush administration will extend the steel agreement", while the true meaning is "the bush administration will extend the steel quotas". For syntactic linearization, such semantic variation is tolerable.

Results with auto-parsed data
There is no syntactically annotated data in many domains. As a result, performing syntactic linearization in these domains requires automatically

Actions
Top similar actions S-wednesday S-tuesday S-friday S-thursday S-monday S-huge S-strong S-serious S-good S-large S-taxes S-bills S-expenses S-loans S-payments S-secretary S-department S-officials S-director S-largely S-partly S-primarily S-mostly S-entirely  parsed training data, which may affect the performance of our syntactic linearizer. We study this effect by training both SYN×LSTM and SYN l ×LSTM with automatically parsed training data of different parsing accuracies, and show the results, which are generated with beamsize 64 on the devset, in Table 6. Generally, a higher parsing accuracy can lead to a better linearization result for both systems. It conforms to the intuition that syntactic quality affects the fluency of surface texts. On the other hand, the influence is not large, the BLEU scores of SYN l ×LSTM and SYN×LSTM drop by 1.5 and 2.8 BLEU points, respectively, as the parsing accuracy decreases from gold to 54%. Both observations are consistent with that of  for discrete syntactic linearization. Finally, SYN l ×LSTM shows less BLEU score decreases than SYN×LSTM. The reason is that SYN l ×LSTM only takes word features, and is less vulnerable to parsing accuracy decrease.

Embedding similarity
One main advantage of neural systems is that they use vectorized features, which are less sparse than discriminative features. Taking W 2 as the embedding matrix of actions, we calculate the top similar actions for the SHIFT-w actions by cosine distance and show examples in Table 7. In addition, Figure 5 presents the t-SNE visualization (Maaten and Hinton, 2008) of the embeddings for the POS-p actions. Generally, the embeddings of similar actions are closer than these of other actions. From both results, we can see that our model learns reasonable embeddings from the Penn Treebank, a small-scale corpus, which shows the effectiveness of our system from another perspective.

Conclusion
We studied neural transition-based syntactic linearization, which combines the advantages of both neural networks and syntactic information. In addition, we compared two ways of integrating a neural language model into our system. Experimental results show that our system achieves improved results comparing with a state-of-the-art multi-layer LSTM language model. To our knowledge, we are the first to investigate neural syntactic linearization.
In the future work, we will investigate LSTM on this task. In particular, an LSTM decoder, taking features form the already-built subtrees as part of its inputs, is taken to model the sequences of shift-reduce actions. Another possible direction is creating complete graphs with their nodes being the input words, before encoding them with self-attention networks (Vaswani et al., 2017) or graph neural networks (Kipf and Welling, 2016;Beck et al., 2018;. This approach can be better at capturing word-to-word dependencies than simply summing word embeddings up.