genCNN: A Convolutional Architecture for Word Sequence Prediction

We propose a novel convolutional architecture, named $gen$CNN, for word sequence prediction. Different from previous work on neural network-based language modeling and generation (e.g., RNN or LSTM), we choose not to greedily summarize the history of words as a fixed length vector. Instead, we use a convolutional neural network to predict the next word with the history of words of variable length. Also different from the existing feedforward networks for language modeling, our model can effectively fuse the local correlation and global correlation in the word sequence, with a convolution-gating strategy specifically designed for the task. We argue that our model can give adequate representation of the history, and therefore can naturally exploit both the short and long range dependencies. Our model is fast, easy to train, and readily parallelized. Our extensive experiments on text generation and $n$-best re-ranking in machine translation show that $gen$CNN outperforms the state-of-the-arts with big margins.


Introduction
Both language modeling (Wu and Khudanpur, 2003;Mikolov et al., 2010;Bengio et al., 2003) and text generation (Axelrod et al., 2011) boil down to modeling the conditional probability of a word given the proceeding words.Previously, it is mostly done through purely memory-based approaches, such as n-grams, which cannot deal with long sequences and has to use some heuristics (called smoothing) for rare ones.Another family of methods are based on distributed representations of words, which is usually tied with a neural-network (NN) architecture for estimating the conditional probabilities of words.
Two categories of neural networks have been used for language modeling: 1) recurrent neural networks (RNN), and 2) feedfoward network (FFN): • The RNN-based models, including its variants like LSTM, enjoy more popularity, mainly due to their flexible structures for processing word sequences of arbitrary lengths, and their recent empirical success (Sutskever et al., 2014;Graves, 2013).We however argue that RNNs, with their power built on the recursive use of a relatively simple computation units, are forced to make greedy summarization of the history and consequently not efficient on modeling word sequences, which clearly have a bottom-up structures.• The FFN-based models, on the other hand, avoid this difficulty by directly taking the history as input.However the FFNs are fully-connected networks, rendering them inefficient on capturing local structures of languages.Moreover their "rigid" architectures make it futile to handle the great variety of patterns in long range correlations of words.We propose a novel convolutional neural network architecture, named genCNN, for efficiently combining local and long range structures of language with the purpose of modeling conditional probabilities.genCNN can be directly used in generating a word sequence (i.e., text generation) or evaluating Here "/" stands for a zero padding.In this example, each CNN component covers 6 words, while in practice the coverage is 30-40 words.
the likelihood of a word sequence (i.e., language modeling).We also show the empirical superiority of genCNN on both tasks over traditional n-grams and its RNN and FFN counterparts.

Overview
As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types: • αCNN as the "front-end", dealing with the history that is closest to the prediction; • βCNNs (which can repeat), in charge of more "ancient" history.
Together, genCNN takes history e 1:t of arbitrary length to predict the next word e t+1 with probability p(e t+1 |e 1:t ; Θ), based on a representation φ(e 1:t ; Θ) produced by the CNN, and a |V|-class soft-max: p(e t+1 |e 1:t ; Θ) ∝ e µ e t+1 φ(e 1:t )+be t+1 . (2) genCNN is fully tailored for modeling the sequential structure in natural language, notably different from conventional CNN (Lawrence et al., 1997;Hu et al., 2014) in 1) its specifically designed weightssharing strategy (in αCNN), 2) its gating design, and 3) certainly its recursive architectures.Also distinct from RNN, genCNN gains most of its processing power from the heavy-duty processing units (i.e.,αCNN and βCNNs), which follow a bottom-up information flow and yet can adequately capture the temporal structure in word sequence with its convolutional-gating architecture.

genCNN: Architecture
We start with discussing the convolutional architecture of αCNN as a stand-alone sentence model, and then proceed to the recursive structure.After that we give a comparative analysis on the mechanism of genCNN.αCNN, just like a normal CNN, has fixed architecture with predefined maximum words (denoted as L α ).History shorter than L α will filled with zero paddings, and history longer than that will be folded to feed to βCNN after it, as will be elaborated in Section 3.3.Similar to most other CNNs, αCNN alternates between convolution layers and pooling layers, and finally a fully connected layer to reach the representation before soft-max, as illustrated by Figure 2. Unlike the toyish example in Figure 2, in practice we use a larger and deeper αCNN with L α = 30 or 40, and two or three convolution layers (see Section 4.1).
In Section 3.1 we will introduce the hybrid design of convolution in genCNN for capturing structures of different nature in word sequence prediction.In Section 3.2, we will discuss the design of gating mechanism.

αCNN: Convolution
Different from conventional CNN, the weights of convolution units in αCNN is only partially shared.More specifically, in the convolution units there are two types feature-maps: TIME-FLOW and the TIME-ARROW, illustrated respectively with the unfilled nodes and filled nodes in Figure 2. The parameters for TIME-FLOW are shared among different convolution units, while for TIME-ARROW the parameters are location-dependent.Intuitively, TIME-FLOW acts more like a conventional CNN (e.g., that in (Hu et al., 2014)), aiming to understand the overall temporal structure in the word sequences; TIME-ARROW, on the other hand, works more like a traditional NN-based language model (Vaswani et al., 2013;Bengio et al., 2003): with its location-dependent parameters, it focuses on capturing the prediction task and the direction of time. where gives the output of feature-map of type-f for location i in Layer-; • σ(•) is the activation function, e.g., Sigmoid or Relu (Dahl et al., 2013) • (w TF ) denotes the location-independent parameters for f ∈ TIME-FLOW on Layer-, while (w ) stands for that for f ∈TIME-ARROW and location i on Layer-; i denotes the segment of Layer-−1 for the convolution at location i , while concatenates the vectors for k 1 words from sentence input x.

Gating Network
Previous CNNs, including those for NLP tasks (Hu et al., 2014;Kalchbrenner et al., 2014), take a straightforward convolution-pooling strategy, in which the "fusion" decisions (e.g., selecting the largest one in max-pooling) are made based on the values of feature-maps.This is essentially a soft template matching (Lawrence et al., 1997), which works for tasks like classification, but undesired for maintaining the composition functionality of convolution.In this paper, we propose to use separate gating networks to release the scoring duty from the convolution, and let it focus on composition.Similar idea has been proposed by (Socher et al., 2011) for recursive neural networks on parsing tasks, but never been combined with a convolutional architecture.

…
Layer-Layer- as the input (denoted as z( ) j ) for gating network, as illustrated in Figure 3.We use a separate gate for each feature-map, but follow a different parametrization strategy for TIME-FLOW and TIME-ARROW.With window size = 2, the gating is binary, we use a logistic regressor to determine the weights of two candidates.For f ∈TIME-ARROW, with location-dependent w ( ,f,j) gate , the normalized weight for left side is

Layergating
while for For f ∈TIME-FLOW, the parameters for the corresponding gating network, denoted as w ( ,f ) gate , are shared.The gated feature map is then a weighted sum to feature-maps from the two windows: . (5) We find that this gating strategy works significantly better than direct pooling over feature-maps, and also slightly better than a hard gate version of Equation (5).

Recursive Architecture
As suggested early on in Section 2 and Figure 1, we use extra CNNs with conventional weight-sharing, named βCNN, to summarize the history out of scope of αCNN.More specifically, the output of βCNN (with the same dimension of word-embedding) is put before the first word as the input to the αCNN, as illustrated in Figure 4. Different from αCNN, βCNN is designed just to summarize the history, with weight shared across its convolution units.In a sense, βCNN has only TIME-FLOW feature-maps.
All βCNN are identical and recursively aligned, enabling genCNN to handle sentences with arbitrary length.We put a special switch after each βCNN to turn it off (replacing a pading vector shown as "/" in Figure 4) when there is no history assigned to it.As the result, when the history is shorter than L α , the recursive structure reduces to αCNN.In practice, 90+% sentences can be modeled by αCNN with L α = 40 and 99+% sentences can be contained with one extra βCNN.Our experiment shows that this recursive strategy yields better estimate of conditional density than neglecting the out-of-scope history (Section 6.1.2).In practice, we found that a larger (greater L α ) and deeper αCNN works better than small αCNN and more recursion of βCNN, which is consistent with our intuition that the bottom-up convolutional architecture is well suited for modeling the sequence.

TIME-FLOW vs. TIME-ARROW
Both conceptually and systemically, genCNN gives two interweaved treatments of word history.With the globally-shared parameters in the convolution units, TIME-FLOW summarizes what has been said.The hierarchical convolution+gating architecture in TIME-FLOW enables it to model the composition in language, yielding representation of segments at different intermediate layers.TIME-FLOW is aware of the sequential direction, inherited from the space-awareness of CNN, but it is not sensitive enough about the prediction task, due to the uniform weights in the convolution.
On the other hand, TIME-ARROW, living in location-dependent parameters of convolution units, acts like an arrow pin-pointing the prediction task.TIME-ARROW has predictive power all by itself, but it concentrates on capturing the direction of time and consequently short on modelling the longrange dependency.
TIME-FLOW and TIME-ARROW have to work together for optimal performance in predicting what is going to be said.This intuition has been empirically verified, as our experiments have demonstrated that TIME-FLOW or TIME-ARROW alone perform inferiorly.One can imagine, through the layerby-layer convolution and gating, the TIME-ARROW gradually picks the most relevant part from the representation of TIME-FLOW for the prediction task, even if that part is long distance ahead.

genCNN vs. RNN-LM
Different from RNNs, which recursively applies a relatively simple processing units, genCNN gains its ability on sequence modeling mostly from its flexible and powerful bottom-up and convolution architecture.genCNN takes the "uncompressed" history, therefore avoids • the difficulty in finding the representation for history (i.e., unfinished sentences), especially those end in the middle of a chunk (e.g.,"the cat sat on the"), • the damping effort in RNN when the history-summarizing hidden states are updated at each time, which renders the long term memory rather difficult.
Both drawbacks can only be partially ameliorated with complicated design of gates (Hochreiter and Schmidhuber, 1997) and or more heavy processing units (essentially a fully connected DNN) (Sutskever et al., 2014).

genCNN: Training
The parameters of a genCNN Θ consists of the parameters for CNN Θ nn , word-embedding Θ embed , and the parameters for soft-max Θ sof tmax .All the parameters are jointly learned by maximizing the likelihood of observed sentences.Formally the log-likelihood of sentence S n which can be trivially split into T n training instances during the optimization, in contrast to the training of RNN that requires unfolding through time due to the temporal-dependency of the hidden states.

Implementation Details
Architectures: In all of our experiments (Section 5 and 6) we set the maximum words for αCNN to be 30 and that for βCNN to be 20.αCNN have two convolution layers (both containing TIME-FLOW and TIME-ARROW convolution) and two gating layers, followed by a fully connected layer (400 dimension) and then a soft-max layer.The numbers of feature-maps for TIME-FLOW are respectively 150 (1st convolution layer) and 100 (2nd convolution layer), while TIME-ARROW has the same feature-maps.βCNN is relatively simple, with two convolution layer containing only TIME-FLOW with 150 feature-maps, two gating layers and a fully connected layer.We use ReLU as the activation function for convolution layers and switch to Sigmoid for fully connected layers.We use word embedding with dimension 100.
Soft-max: Calculating a full soft-max is expensive since it has to enumerate all the words in vocabulary (in our case 40K words) in the denominator.Here we take a simple hierarchical approximation of it, following (Bahdanau et al., 2014).Basically we group the words into 200 clusters (indexed by c m ), and factorize (in an approximate sense) the conditional probability of a word p(e t |e 1:t−1 ; Θ) into the probability of its cluster and the probability of e t given its cluster p(c m |e 1:t−1 ; Θ) p(e t |c m ; Θ sof tmax ).
We found that this simple heuristic can speed-up the optimization by 5 times with only slight loss of accuracy.
Optimization: We use stochastic gradient descent with mini-batch (size 500) for optimization, aided further by AdaGrad (Duchi et al., 2011).For initialization, we use Word2Vec (Mikolov et al., 2013) for the starting state of the word-embeddings (trained on the same dataset as the main task), and set all the other parameters by randomly sampling from uniform distribution in [−0.1, 0.1].The optimization is done mainly on a Tesla K40 GPU, which takes about 2 days for the training on a dataset containing 1M sentences.

Experiments: Sentence Generation
In this experiment, we randomly generate sentences by recurrently sampling e t+1 ∼ p(e t+1 |e 1:t ; Θ), and put the newly generated word into history, until EOS (end-of-sentence) is generated.We consider generating two types of sentences: 1) the plain sentences, and 2) sentences with dependency parsing, which will be covered respectively in Section 5.1 and 5.2.

Natural Sentences
We train genCNN on Wiki data with 112M words for one week, with some representative examples randomly generated given in Table 1 (upper and middle blocks).We try two settings, by asking genCNN to 1) finish a sentence started by human (upper block), or 2) generate a sentence from the beginning (middle block), or It is fairly clear that most of the time genCNN can generate sentences that are syntactically grammatical and semantically meaningful.More specifically, most of the sentences can be aligned to a parse tree with reasonable structure.It is also worth noting that quotation marks ('' and '') are always generated in pairs and in the correct order, even across a relatively long distance, as exemplified by the first sentence in the upper block.
'' we are in the building of china 's social development and the businessmen audience , '' he said .
clinton was born in DDDD , and was educated at the university of edinburgh.
bush 's first album , '' the man '' , was released on DD november DDDD .
it is one of the first section of the act in which one is covered in real place that recorded in norway .
this objective is brought to us the welfare of our country russian president putin delivered a speech to the sponsored by the 15th asia pacific economic cooperation ( apec ) meeting in an historical arena on oct .
light and snow came in kuwait and became operational , but was rarely placed in houston .Table 1: Examples of sentences generated by genCNN.In the upper block (row 1-4) the underline words are given by the human; In the middle block (row 5-8), all the sentences are generated without any hint.The bottom block (row 9-12) shows the sentences with dependency tag generated by genCNN trained with parsed examples.

Sentences with Dependency Tags
For training, we first parse (Klein and Manning, 2002) the English sentences and feed sequences with dependency tags as follows to genCNN, where 1) each paired parentheses contain a subtree, and 2) the symbol " " indicates that the word next to it is the dependency head in the corresponding sub-tree.Some representative examples generated by genCNN are given in Table 1 (bottom block).As it suggests, genCNN is fairly accurate on respecting the rules of parentheses, and probably more remarkably, it can get the dependency tree head correct most of the time.

Experiments: Language Modeling
We evaluate our model as a language model in terms of both perplexity (Brown et al., 1992) and its efficacy in re-ranking the n-best candidates from state-of-the-art models in statistical machine translation, both with comparison to the following competitor language models.
Competitor Models we compare genCNN to the following competitor models • 5-gram: We use SRI Language Modeling Toolkit (Stolcke and others, 2002) to train a 5-gram language model with modified Kneser-Ney smoothing; • FFN-LM: The neural language model based on feedfoward network (Vaswani et al., 2013).We vary the input window-size from 5 to 20, while the performance stops improving after window size 20; • RNN: we use the implementation1 of RNN-based language model with hidden size 600 for optimal performance of it; • LSTM: we use the code in Groundhog2 , but vary the hyper-parameters, including the depth and word-embedding dimension, for best performance.LSTM (Hochreiter and Schmidhuber, 1997) is widely considered to be the state-of-the-art for sequence modeling.

Perplexity
We test the performance of genCNN on PENN TREEBANK and FBIS, two public datasets with different sizes.

On PENN TREEBANK
Although a relatively small dataset3 , PENN TREEBANK is widely used as a language modelling benchmark (Graves, 2013;Mikolov et al., 2010).It has 930, 000 words in training set, 74, 000 words in validation set, and 82, 000 words in test set.We use exactly the same settings as in (Mikolov et al., 2010), with a 10, 000-words vocabulary (all out-of-vocabulary words are replaced with unknown) and end-of-sentence token (EOS).In addition to the conventional testing strategy where the models are kept unchanged during testing, Mikolov et al. (2010) proposes to also update the parameters in an online fashion when seeing test sentences.This new way of testing, named "dynamic evaluation", is also adopted by Graves (2013).
From Table 2, genCNN manages to give perplexity superior in both metrics, with about 25 point reduction over the widely used 5-gram, and over 10 point reduction from LSTM, the state-of-the-art and the second-best performer.We defer the comparison of genCNN variants to next experiment on a larger dataset (FBIS), since PENN TREEBANK is too small for evaluating some of the differences between them.

Model
Perplexity Dynamic 5-gram, KN5 From Table 3 (upper block), genCNN clearly wins again in the comparison to competitors, with over 25 point margin over LSTM (in its optimal setting), the second best performer.Interestingly genCNN outperforms its variants also quite significantly (bottom block): 1) with only TIME-ARROW (same number of feature-maps), the performance deteriorates considerably for losing the ability of capturing long range correlation reliably; 2) with only TIME-FLOW the performance gets even worse, for partially losing the sensitivity to the prediction task.It is quite remarkable that, although αCNN (with L α = 30) can achieve good results, the recursive structure in full genCNN can further decrease the perplexity by over 3 points, indicating that genCNN can benefit from modeling the dependency over range as long as 30 words.

Re-ranking for Machine Translation
In this experiment, we re-rank the 1000-best English translation candidates for Chinese sentences generated by statistical machine translation (SMT) system, and compare it with other language models in the same setting.

SMT setup
The baseline hierarchical phrase-based SMT system ( Chines→ English) was built using Moses, a widely accepted state-of-the-art, with default settings.The bilingual training data is from NIST MT2012 constrained track, with reduced size of 1.1M sentence pairs using selection strategy in (Axelrod et al., 2011).The baseline use conventional 5-gram language model (LM), estimated with modified Kneser-Ney smoothing (Chen and Goodman, 1996) on the English side of the 329Mword Xinhua portion of English Gigaword(LDC2011T07). We also try FFN-LM, as a much stronger language model in decoding.The weights of all the features are tuned via MERT (Och and Ney, 2002) on NIST MT05, and tested on NIST MT06 and MT08.Case-insensitive NIST BLEU 4 is used in evaluation.
Re-ranking with genCNN significantly improves the quality of the final translation.Indeed, it can increase the BLEU score by over 1.33 point over Moses baseline on average.This boosting force barely slacks up on translation with a enhanced language model in decoding: genCNN re-ranker still achieves 1.29 point improvement on top of Moses with FFN-LM, which is 1.76 point over the Moses (default setting).To see the significance of this improvement, the state-of-the-art Neural Network Joint Model (Devlin et al., 2014)

Related Work
In addition to the long thread of work on neural network based language model (Auli et al., 2013;Mikolov et al., 2010;Graves, 2013;Bengio et al., 2003;Vaswani et al., 2013), our work is also related to the effort on modeling long range dependency in word sequence prediction (Wu and Khudanpur, 2003).Different from those work on hand-crafting features for incorporating long range dependency, our model can elegantly assimilate relevant information in an unified way, in both long and short range, with the bottom-up information flow and convolutional architecture.CNN has been widely used in computer vision and speech (Lawrence et al., 1997;Krizhevsky et al., 2012;LeCun and Bengio, 1995;Abdel-Hamid et al., 2012), and lately in sentence representation (Kalchbrenner and Blunsom, 2013), matching (Hu et al., 2014) and classification (Kalchbrenner et al., 2014).To our best knowledge, it is the first time this is used in word sequence prediction.Modelwise the previous work that is closest to genCNN is the convolution model for predicting moves in the Go game (Maddison et al., 2014), which, when applied recurrently, essentially generates a sequence.Different from the conventional CNN taken in (Maddison et al., 2014), genCNN has architectures designed for modeling the composition in natural language and the temporal structure of word sequence.

Conclusion
We propose a convolutional architecture for natural language generation and modeling.Our extensive experiments on sentence generation, perplexity, and n-best re-ranking for machine translation show that our model can significantly improve upon state-of-the-arts.

Figure 1 :
Figure1: The overall diagram of a genCNN.Here "/" stands for a zero padding.In this example, each CNN component covers 6 words, while in practice the coverage is 30-40 words.

Notations:
We use V to denote the vocabulary, e t (∈ {1, • • • , |V|}) to denote the t th word in a sequence e 1:T def = [e 1 , • • • , e T ], and e (n) t if the sequence itself is further indexed by n.

Figure 2 :
Figure 2: Illustration of a 3-layer αCNN.Here the unfilled nodes stand for the TIME-TIME featuremaps, and the the filled nodes for TIME-ARROW.

αCNN e 5
e 6 e 7 e 8 e 7 e 8 e

Table 2 :
PENN TREEBANK results, where the 3rd column are the perplexity in dynamic evaluation, while the numbers for RNN and LSTM are taken as reported in the paper cited above.The numbers in boldface indicate that the result is significantly better than all competitors in the same setting.

Table 4 :
usually brings less than one point increase on this task.The results for re-ranking the 1000-best of Moses.Note that the two bottom rows are on a baseline with enhanced LM.