A Comparison of Neural Models for Word Ordering

We compare several language models for the word-ordering task and propose a new bag-to-sequence neural model based on attention-based sequence-to-sequence models. We evaluate the model on a large German WMT data set where it significantly outperforms existing models. We also describe a novel search strategy for LM-based word ordering and report results on the English Penn Treebank. Our best model setup outperforms prior work both in terms of speed and quality.


Introduction
Finding the best permutation of a multi-set of words is a non-trivial task due to linguistic aspects such as "syntactic structure, selective restrictions, subcategorization, and discourse considerations" (Elman, 1990). This makes the word-ordering task useful for studying and comparing different kinds of models that produce text in tasks such as general natural language generation (Reiter and Dale, 1997), image caption generation (Xu et al., 2015), or machine translation (Bahdanau et al., 2015). Since plausible word order is an essential criterion of output fluency for all of these tasks, progress on the wordordering problem is likely to have a positive impact on these tasks as well. Word ordering has often been addressed as syntactic linearization which is a strategy that involves using syntactic structures or partof-speech and dependency labels (Zhang and Clark, 2011;Zhang et al., 2012;Zhang and Clark, 2015;Liu et al., 2015;Puduppully et al., 2016). It has also been addressed as LM-based linearization which relies solely on language models and obtains better scores (de Gispert et al., 2014;Schmaltz et al., 2016). Recently, Schmaltz et al. (2016) showed that recurrent neural network language models (Mikolov et al., 2010, RNNLMs) with long short-term memory (Hochreiter and Schmidhuber, 1997, LSTM) cells are very effective for word ordering even without any explicit syntactic information.
We continue this line of work and make the following contributions. We compare several language models on the word-ordering task and propose a bag-to-sequence neural architecture that equips an LSTM decoder with explicit context of the bag-ofwords (BOW) to be ordered. This model performs particularly strongly on WMT data and is complementary to an RNNLM: combining both yields large BLEU gains even for small beam sizes. We also propose a novel search strategy which outperforms a previous heuristic. Both techniques together surpass prior work on the Penn Treebank at ∼4x the speed.
2 Bag-to-Sequence Modeling with Attentional Neural Networks Given the BOW {at, bottom, heap, now, of, the, the, we, 're, .}, a word-ordering model may generate an output string w = "now we 're at the bottom of the heap .". We can use an RNNLM (Mikolov et al., 2010) to assign it a probability P (w) by decomposing into conditionals: Since we have access to the input BOWs, we extend the model representation by providing the network additionally with the BOW to be ordered, thereby allowing it to focus explicitly on all tokens it generates in the output during decoding. Thus, instead of mod-arXiv:1708.01809v1 [cs.CL] 5 Aug 2017 eling the a priori distribution of sentences P (w) as in Eq. 1, we condition the distribution on BOW(w): (2) This dependency is realized by the neural attention mechanism recently proposed by Bahdanau et al. (2015). The resulting bag-to-sequence model (bag2seq) is inspired by the attentional sequence-tosequence model RNNSEARCH (seq2seq) proposed by Bahdanau et al. (2015) for neural machine translation between a source sentence x = x I 1 and a target sentence y = y J 1 . Fig. 1a illustrates how seq2seq generates the j-th target token y j using the decoder state s j and the context vector c j . The context vector is the weighted sum of source side annotations h i which encode sequence information.
To modify seq2seq for problems with unordered input, we make the encoder architecture orderinvariant by replacing the recurrent layer with nonrecurrent transformations of the word embeddings, as indicated by the missing arrows between source positions in Fig. 1b. For convenience, we formalize BOW(w) as sequence w 1 , . . . ,w T in which words are sorted, e.g. alphabetically, so that we can refer to the t-th word in the BOW. The model can be trained to recover word order in a sentence by using BOW(w) = w 1 , . . . ,w T as input and the original sequence w 1 , . . . , w T as target. This network architecture does not prevent words outside the BOW to appear in the output. Therefore, we explicitly constrain our beam decoder by limiting its available output vocabulary to the remaining tokens in the input bag at each time step, thereby ensuring that all model outputs are valid permutations of the input.

Search
Beam search is a popular decoding algorithm for neural sequence models (Sutskever et al., 2014;Bahdanau et al., 2015). However, standard beam search suffers from search errors when applied to word ordering and Schmaltz et al. (2016) reported that gains often do not saturate even with a large beam of 512. They suggested adding external unigram probabilities of the remaining words in the BOW as future cost estimates to the beam-search scoring function and reported large gains for an ngram LM and RNNLM. We re-implement this future cost heuristic, f (·), and further propose a new search heuristic, g(·), which collects internal unigram statistics during decoding. We keep hypotheses in the beam if their score is close to a theoretical upper bound, the product of the best word probabilities given any history within the explored search space. For each wordw ∈ BOW(w) we maintain a heuristic score estimateP (w) which we initialize to 0. Each time the search algorithm visits a new context, we update the estimates such thatP (w) is the current best score forw: where C t is the set of contexts (i.e. ordered prefixes in the form of w t 1 ) explored by beam search so far. Thus, instead of computing a future cost, we compare the actual score of a partial hypothesis with the product of heuristic estimates of its words. This is especially useful for model combinations since all models are taken into account. We also implement hypothesis recombination to further reduce the number of search errors. More formally, at each time step t our beam search keeps the n best hypotheses according to scoring function S(·) using partial model score s(·) and estimates g(·):

Experimental Setup
We evaluate using data from the English-German news translation task (Bojar et al., 2015, WMT) and using the English Penn Treebank data (Marcus et al., 1993, PTB). Since additional knowledge sources are often available in practice, such as access to the source sentence in a translation scenario, we also report on bilingual experiments for the WMT task.

Data and evaluation
The WMT parallel training data includes Europarl v7, Common Crawl, and News Commentary v10.
We use news-test2013 for tuning model combinations and news-test2015 for testing. All monolingual models for the WMT task were trained on the German news2015 corpus (∼51.3M sentences). For PTB, we use preprocessed data by Schmaltz et al. (2016) for a fair comparison (∼40k sentences for training). We evaluate using the multi-bleu.perl script for WMT and mteval-v13.pl for PTB.

Model settings
For WMT, the bag2seq parameter settings follow the recent NMT systems trained on WMT data. We use a 50k vocabulary, 620 dimensional word embeddings and 1000 hidden units in the decoder LSTM cells. On the encoder side, the input tokens are embedded to form annotations of the same size as the hidden units in the decoder. The RNNLM is based on the "large" setup of Zaremba et al. (2014)  uses 300 dimensional word embeddings and 500 hidden units in the decoder LSTM. We also compare to GYRO (de Gispert et al., 2014) which explicitly targets the word-ordering problem. We extracted 1-gram to 5-gram phrase rules from the PTB training data and used an n-gram LM for decoding. For

Word Ordering on WMT data
The top of Tab. 1 shows that bag2seq outperforms all other language models by up to 4.2 BLEU on ordering German (bold numbers highlight its improvements). This suggests that explicitly presenting all available tokens to the decoder during search enables it to make better word order choices. A combination of RNNLM, NPLM and n-gram LM yields a higher score than the individual models, but further adding bag2seq yields a large gain of 4.5 BLEU confirming its suitability for the word-ordering task.
In the bilingual setting in the bottom of Tab. 1, the seq2seq model is given English input text and the beam decoder is constrained to generate permutations of German BOWs. This is effectively a translation task with knowledge of the target BOWs and seq2seq provides a strong baseline since it uses source sequence information. Still, adding bag2seq yields a 2.9 BLEU gain and adding it to the combination of all other models still improves by 1.8 BLEU. This suggests that it could also help for machine translation rescoring by selecting hypotheses that constitute good word orderings.

Word Ordering on the Penn Treebank
Tab. 2 shows the performance of different models and search heuristics on the Penn Treebank: using   (2016).
no heuristic (none) vs. f (·) and g(·) described in Section 3. Numbers in bold mark the best result for a given model. We compare against the LMbased method of de Gispert et al. (2014) and the n-gram and RNNLM (LSTM) models of Schmaltz et al. (2016), of which the latter achieves the best BLEU score of 42.7. We can reproduce or surpass prior work for n-gram and RNNLM and show that g(·) outperforms f (·) for these models. This also holds when adding a 900k sample from the English Gigaword corpus as proposed by Schmaltz et al. (2016). 3 However, bag2seq underperforms RNNLM at this large beam size. Since decoding is slow for large beam sizes, we compare bag2seq to the n-gram and RNNLM using a small beam of size 5 in Tab. 3. The first three rows show that decoding without heuristics is much easier with bag2seq and outperforms n-gram and RNNLM by a large margin with 33.4 BLEU. The RNNLM needs heuristic f (·) to match this performance. For bag2seq, using heuristic estimates is worse than just using its partial scores for search. We suspect that its partial model scores are obfuscated by the heuristic estimates and the amount of their contribution should probably be tuned on a heldout set. Using the same beam size, ensembles yield better results but the best results are achieved by combining RNNLM and bag2seq (37.9 BLEU). This confirms our findings on WMT data that these models are highly complementary for word ordering. The results for beam=64 follow this pattern and identify an interaction between heuristics and beam size. While we get the best results for beam=5 using f (·), heuristic g(·) seems to perform better for larger beams,   perhaps because the internal unigram statistics become more reliable. Finally, RNNLM+bag2seq with g(·) and beam=64 outperforms LSTM-512 by 0.8 BLEU. This is significant because decoding in this configuration is also ∼4x faster than decoding with a single RNNLM and beam=512 as shown in Fig. 2.

Conclusion
We have compared various models for the wordordering task and proposed a new model architecture inspired by attention-based sequence-to-sequence models that helps performance for both German and English tasks. We have also proposed a novel search heuristic and found that using a model combination together with this heuristic and a modest beam size provides a good trade-off between speed and quality and outperforms prior work on the PTB task.