An Empirical Comparison Between N-gram and Syntactic Language Models for Word Ordering

Syntactic language models and N -gram language models have both been used in word ordering. In this paper, we give an empirical comparison between N -gram and syntactic language models on word order task. Our results show that the quality of automatically-parsed training data has a relatively small impact on syntactic models. Both of syntactic and N -gram models can beneﬁt from large-scale raw text. Compared with N -gram models, syntactic models give overall better performance, but they require much more training time. In addition, the two models lead to different error distributions in word ordering. A combination of the two models integrates the advantages of each model, achieving the best result in a standard benchmark.


Introduction
N-gram language models have been used in a wide range of the generation tasks, such as machine translation (Koehn et al., 2003;Chiang, 2007;Galley et al., 2004), text summarization (Barzilay and McKeown, 2005) and realization (Guo et al., 2011). Such models are trained from large-scale raw text, capturing distributions of local word Ngrams, which can be used to improve the fluency of synthesized text.
More recently, syntactic language models have been used as a complement or alternative to Ngram language models for machine translation (Charniak et al., 2003;Shen et al., 2008;Schwartz et al., 2011), syntactic analysis (Chen et al., 2012) and tree linearization . Compared with N-gram models, syntactic models capture rich structural information, and can be more effective in improving the fluency of large constituents, long-range dependencies and overall sentential grammaticality. However, Syntactic models require annotated syntactic structures for training, which are expensive to obtain manually. In addition, they can be slower compared to Ngram models.
In this paper, we make an empirical comparison between syntactic and N-gram language models on the task of word ordering (Wan et al., 2009;Zhang and Clark, 2011a;De Gispert et al., 2014), which is to order a set of input words into a grammatical and fluent sentence. The task can be regarded as an abstract language modeling problem, although methods have been explored extending it for tree linearization (Zhang, 2013), broader text generation  and machine translation .
We choose the model of Liu et al.(2015) as the syntactic language model. There has been two main types of syntactic language models in the literature, the first being relatively more oriented to syntactic structure, without an explicit emphasis on word orders (Shen et al., 2008;Chen et al., 2012). As a result, this type of syntactic language models are typically used jointly with N-gram model for text-to-text tasks. The second type models syntactic structures incrementally, thereby can be used to directly score surface orders (Schwartz et al., 2011;Liu et al., 2015). We choose the discriminative model of Liu et al. (2015), which gives state-of-the-art results for word ordering.
We try to answer the following research questions by comparing the syntactic model and the Ngram model using the same search algorithm.
• What is the influence of automaticallyparsed training data on the performance of syntactic models.
Because manual syntactic annotations are relatively limited and highly expensive, it is necessary to use large-scale automatically-parsed sentences for training syntactic language models. As a result, the syntactic structures that a word ordering system learns can be inaccurate. However, this might not affect  the quality of the synthesized output, which is a string only. We quantitatively study the influence of parsing accuracy of syntactic training data on word ordering output.
• What is the influence of data scale on the performance. N-gram language models can be trained efficiently over large numbers of raw sentences. In contrast, syntactic language models can be much slower to train due to rich features. We compare the output quality of the two models on different scales of training data, and also on different amounts of training time.
• What are the errors characteristics of each model. Syntactic language models can potentially be better in capturing larger constituents and overall sentence structures. However, compared with N-gram models, little work has been done to quantify the difference between the two models. We characterise the outputs using a set of different measures, and show empirically the relative strength and weakness of each model.
• What is the effect of model combination. Finally, because the two models make different types of errors, they can be combined to give better outputs. We develop a combined model by discretizing probability from N-gram model, and using them as features in the syntactic model. The combined model gives the best results in a standard benchmark.

Syntactic word ordering
Syntactic word ordering algorithms take a multiset of input words constructing an output sentence and its syntactic derivation simultaneously. Transition-based syntactic word ordering can be modelled as an extension to transition-based parsing (Liu et al., 2015), with the main difference be- Figure 2: Transition-based process for ordering {"potatoes 0 ", "Tom 1 ", "likes 2 "}.
ing that the order of words is not given in the input, which leads to a much larger search space. We take the system of Liu,et al. 1 , which gives state-of-the-art performance and efficiencies in standard word ordering benchmark. It maintains outputs in stack σ, and orders the unprocessed incoming words in a set ρ. Given an input bag of words, ρ is initialized to the input and σ is initialized as empty. The system repeatedly applies transition actions to consume words from ρ and construct output on σ. Figure 1 shows the deduction system, where ρ is unordered and any word in ρ can be shifted onto the stack σ. The set of actions are SHIFT, L-ARC and R-ARC. The SHIFT actions add a word to the stack. For the L-ARC and R-ARC actions, new arcs {j ← i} and {j → i} are constructed respectively. Under these possible actions, the unordered word set "potatoes 0 Tom 1 likes 2 " is generated as shown in Figure 2, and the result is "Tom 1 ←likes 2 →potatoes 0 ". We apply the learning and search framework of Zhang and Clark (2011a). Pseudocode of the search algorithm is shown in Algorithm 1. [] refers to an empty stack, and set(1...n) represents the full set of input words W and n is the number of distinct words. candidates stores possible states, and agenda stores temporary states transited from possible actions. GETACTIONS generates a set of possible actions depending on the current state s. APPLY generates a new state by applying action on the current state s. N-BEST produces the top k candidates in agenda. Finally, the algorithm returns the highest-score state best in the agenda.
A global linear model is used to score search hypotheses. Given a hypothesis h, its score is calculated by: where Φ(h) is the feature vector of h, extracted by using the same feature templates as Liu et al.(2015), which are shown in Table 1 and θ is the parameter vector of the model. The feature templates essentially represents a syntactic language model. As shown in Figure 2, from the hypotheses produced in steps 2 and 4, the features "T om 1 ← likes 2 " and "likes 2 → potatoes 0 " are extracted, which corresponds to P (T om 1 |likes 2 ) and P (potatoes 0 |likes 2 ) respectively in the dependency language model of Chen et al.,(2012). Training. We apply perceptron with early-update (Collins and Roark, 2004), and iteratively tune related parameters on a set of development data. For each iteration, we measure the performance on the development data, and choose best parameters for final tests.

N-gram word ordering
We build an N-gram word ordering system under the same beam-search framework as the syntactic word ordering system. In particular, search is performed incrementally, from left to right, adding one word at each step. The decoding process can be regarded as a simplified version of Algorithm 1, with only SHIFT being returned by GETACTIONS, and the score of each transition is given by a standard N-gram language model. We use the same beam size for both N-gram and the syntactic word ordering. Compared with the syntactic model, the N-gram model has less information for disambiguation, but also has less structural ambiguities, and therefore a smaller search space. Unigram S0w; S0p; S 0,l w; S 0,l p; S0,rw; S0,rp; S 0,l2 w; S 0,l2 p; S0,r2w; S0,r2p; S1w; S1p; S 1,l w; S 1,l p; S1,rw; S1,rp; S 1,l2 w; S 1,l2 p; S1,r2w; S1,r2p;

News
But after rising steadily during the quartercentury following World War II , wages have stagnated since the manufacturing sector began to contract .

Blog
The freaky thing here is that these bozos are seriously claiming the moral high ground ?  , 1993), and the Agence France-Presse (AFP) and Xinhua News Agency (XIN) subsets of the English Giga Word Fifth Edition (Parker et al., 2011). As the development data, we use WSJ section 0 for parameter tuning. For testing, we use data from various domain, which consist of WSJ section 23, Washington Post/Bloomberg(WPB) subsets of the English Giga Word Fifth Edition and SANCL blog data, as shown in Table 2. Example sentence in various test domains are shown in Table 3.

Evaluation metrics
We follow previous work and use the BLEU metric (Papineni et al., 2002) for evaluation. Since BLEU only scores N-gram precisions, it can be in favour of N-gram language models. We additionally use METEOR 3 (Denkowski and Lavie, 2010) to evaluate the system performances. The BLEU metric measures the fluency of generated sentence without considering long range ordering. The ME-TEOR metric can potentially fix this problem using a set of mapping between generated sentences and references to evaluate distortion. The following example illustrates the difference between BLEU and METEOR on long range reordering, where the reference is (  the METEOR gives a score of 61.34 out of 100. This is because that METEOR is based on explicit word-to-word matches over the whole sentence. For word ordering, word-to-word matches are unique, which facilitates METEOR evaluation between generated sentences and references. As can bee seen from the example, long range distortion can highly influence the METEOR scores making the METEOR metric more suitable for evaluating word ordering distortions.

Data preparation
For all the experiments, we assume that the input is a bag of words without order, and the output is a fully ordered sentence. Following previous work (Wan et al., 2009;Zhang, 2013;Liu et al., 2015), we treat base noun phrases (i.e. noun phrases do not contains other noun phrases, such as 'Pierre Vinken' and 'a big cat') as a single word. This avoids unnecessary ambiguities in combination between their subcomponents. The syntactic model requires that the training sentences have syntactic dependency structure. However, only the WSJ data contains goldstandard annotations. In order to obtain automatically annotated dependency trees, we train a constituent parser using the gold-standard bracketed sentences from WSJ, and automatically parse the Giga Word data. The results are turned into dependency trees using Penn2Malt 4 , after base noun phrases are extracted. In our experiments, we use ZPar 5 (Zhu et al., 2013) for automatic constituent parsing.
In order to study the influence of parsing accuracy of the training data, we also use ten-fold jackknifing to construct WSJ training data with different accuracies. The data is randomly split into ten equal-size subsets, and each subset is automatically parsed with a parser trained on the other  nine subset. In order to obtain datasets with different parsing accuracies, we randomly sample a small number of sentences from each training subset, as shown in Table 4. The dependency trees of each set are derived from these bracketed sentences using Penn2Malt after base noun phrase are extracted as a single word.

Influence of parsing accuracy 4.1 In-domain word ordering
We train the syntactic models on the WSJ training parsing data with different accuracies. The WSJ development data are used to find out the optimal number of training iterations for each experiments, and the WSJ test results are shown in Table 5. Table 5 shows that the parsing accuracy can affect the performance of the syntactic model. A higher parsing accuracy can lead to a better syntactic language model. It conforms to the intuition that syntactic quality affects the fluency of surface texts. On the other hand, the influence is not huge, the BLEU scores decrease by 1.0 points as the parsing accuracy decreases from 88.10% to 57.31%

Cross-domain word ordering
The influence of parsing accuracy of the training data on cross-domain word ordering is measured by using the same training settings, but testing on the WPB and SANCL test sets. Table 5 shows that the performance on cross-domain word ordering cannot reach that of in-domain word ordering using the syntactic models. Compared with the cross-domain experiments, the influence of parsing accuracy becomes smaller. In the WPB test, the fluctuation of performance decline to about 0.9 BLEU points, and in the SANCL test, the fluctuation is about 1.1 BLEU points.
In conclusion, the experiments show that pars-ing accuracies have a relatively small influence on the syntactic models. This suggests that it is possible to use large automatically-parsed data to train syntactic models. On the other hand, when the training data scale increases, syntactic models can become much slower to train compared with Ngram models. The influence on data scale, which includes output quality and training time, is further studied in the next section.

Influence of data scale
We use the AFP news data as the training data for the experiments of this section. The syntactic models are trained using automatically-parsed trees derived from ZPar, as described in Section 3.3. The WPB test data is used to measure indomain performance, and the SANCL blog data is used to measure cross-domain performance.

Influence on BLEU and METEOR
The Figure 3 and 4 shows that using both the BLEU and the METEOR metrics, the performance of the syntactic model is better than that of the N-gram models. It suggests that sentences generated by the syntactic model have both better fluency and better ordering. The performance of the syntactic models is not highly weakened in cross-domain tests. The grey dot in each figure shows the performance of the syntactic model trained on the gold WSJ training data, and evaluated on the same WPB and SANCL test data sets. A comparison between the grey dots and the dashed lines shows that the syntactic model trained on the WSJ data perform better than the syntactic model trained on similar amounts of AFP data. This again shows the effect of syntactic quality of the training data.
On the other hand, as the scale of automaticallyparsed AFP data increases, the performance of the syntactic model rapidly increases, surpassing the syntactic model trained on the high-quality WSJ data. This observation is important, showing that large-scale data can be used to alleviate the problem of lower syntactic quality in automaticallyparsed data, which can be leveraged to address the scarcity issue of manually annotated data in both in-domain and cross-domain settings.

Influence on training time
The training time of both syntactic models and N-gram models increases as the size of training data increases. Figure 5 shows the BLEU of the two systems under different amounts of training time. There is no result reported for the syntactic model beyond 1 million training sentences, because training becomes infeasibly slow 6 . On the other hand, the N-gram model can be trained using all the WSJ, AFP, XIN training sentences, which are 53 millions, within 10 3.2 seconds. As a result, there is no overlap between the syntactic model and the N-gram model curves.
As can be seen from the figure, the syntactic model is much slower to train. However, it benefits more from the scale of the training data, with the slope of the dashed curve being steeper than that of the solid curve. The N-gram model can be trained with more data thanks to the fast training speed. However, the performance of the Ngram model flattens when the training data size reaches beyond 3 million. Projection of the solid curve suggests that the performance of the N-gram model may not surpass that of the syntactic model even if sufficiently large data is available for training the N-gram model in more time. golden constituent trees, is used to analyze errors in different aspects.

Sentence length
The BLEU and METEOR scores of the two systems on various sentence lengths are shown in Figure 6. The results are measured by binning sentences according to their lengths, so that each bin contains about the same number of sentences. As shown by the figure, the N-gram model performs better on short sentences (less than 8 tokens), and the syntactic model performs better on longer sentences. This can be explained by the fact that longer sentences have richer underlying syntactic structures, which can better captured by the syntactic model. In contrast, for shorter sentences, the syntactic structure is relatively simple, and therefore the N-gram model can give better performance based on string patterns, which form smaller search spaces.

Distortion range
We measure the average distortion rate of output word w using the following metric: where i w is index of word w in the output sentence S w , i w is the index of the word w in the reference sentence. len(S w ) is the number of tokens in  sentence S w . Figure 7 shows distributions of distortion respectively by the syntactic and N-gram model. The N-gram model makes relatively fewer short-range distortions, but more long-range distortions. This can be explained by the local scoring nature of the N-gram model. In contrast, the syntactic model makes less long-range distortions, which can suggest better sentence structure.

Constituent span
We further evaluate sentence structure correctness by evaluating the recalls of discovered constituent span in output two systems, respectively. As shown in Figure 8. The syntactic model performs better in most constituent labels. However, the N-gram model performs better in WHPP, SBARQ and WHNP.
In the test data, WHPP, SBARQ and WHNP are much less than PP, NP, VP, ADJP, ADVP and CONJP, on which the syntactic model gives better recalls. WHNP spans are small and most of them consist of a question word (WP$) and one or two nouns (e.g. "whose (WP$) parents (NNS)"). WHPP spans are also small and usually consist of a preposition (IN) and a WHNP span (e.g "at (IN) what level (WHNP)"). The N-gram model performs better on these small spans. The syntactic model also performs better on S, which covers the whole sentence structure. This verifies the hypothesis introduce that syntactic language models better capture overall sentential grammaticality.

Combining the syntactic and N-gram models
The results above show the respective error characteristics of each model, which are complimentary. This suggests that better results can be achieved by model combination.

N-gram language model feature
We integrate the two types of models by using N-gram language model probabilities as features in the syntactic model. N-gram language model probabilities, which ranges from 0 to 1. Direct use of real value probabilities as features does not work well in our experiments, and we use discretized features instead. For the L-ARC and R-ARC actions, because no words are pushed onto the stack, The NLM feature is set to NULL by default. For the SHIFT action, different feature values are extracted depending on the NLM from 0 to 1. In order to measure the N-gram probabilities on our data, we train the 4-gram language model WSJ, AFP and XIN data, and randomly sample 4gram probabilities from the syntactic model output on the WSJ development data, finding that most of 4-gram probabilities p are larger than 10 −12.5 . In this way, if p lower than 10 −12.5 , NLM feature value is set to LOW. As for p larger than 10 −12.5 , we extract the discrete features by assigning them into different bins. We bin the 4-gram probabilities with different granularities without overlap features. As shown in Table 6, NLM-20, NLM-10, NLM-5 and NLM-2 respectively use 20, 10, 5    Table 9 In addition, Table 7 shows that the N-gram model is the fastest among the models due to its small search space. The running time of the combined system is larger than the pure syntactic system, because of N-gram probability computation. Table 8 compare our results with different previous methods on word ordering. Our combined model gives the best reported performance on this standard benchmarks.

Conclusion
We empirically compared the strengths and error distributions of syntactic and N-gram language models on word ordering, showing that both can benefit from large-scale raw text. The influ-ence of parsing accuracies has relatively small impact on the syntactic language model trained on automatically-parsed data, which enables scaling up of training data for syntactic language models. However, as the size of training data increases, syntactic language models can become intolerantly slow to train, making them benefit less from the scale of training data, as compared with N-gram models.
Syntactic models give better performance compared with N-gram models, despite trained with less data. On the other hand, the two models lead to different error distributions in word ordering. As a result, we combined the advantages of both systems by integrating a syntactic model trained with relatively small data and an N-gram model trained with relatively large data. The resulting model gives better performance than both single models and achieves the best reported scores in a standard benchmark for word ordering.
We release our code under GPL at https:// github.com/SUTDNLP/ZGen. Future work includes application of the system on text-to-text problem such as machine translation.