What to talk about and how? Selective Generation using LSTMs with Coarse-to-Fine Alignment

We propose an end-to-end, domain-independent neural encoder-aligner-decoder model for selective generation, i.e., the joint task of content selection and surface realization. Our model first encodes a full set of over-determined database event records via an LSTM-based recurrent neural network, then utilizes a novel coarse-to-fine aligner to identify the small subset of salient records to talk about, and finally employs a decoder to generate free-form descriptions of the aligned, selected records. Our model achieves the best selection and generation results reported to-date (with 59% relative improvement in generation) on the benchmark WeatherGov dataset, despite using no specialized features or linguistic resources. Using an improved k-nearest neighbor beam filter helps further. We also perform a series of ablations and visualizations to elucidate the contributions of our key model components. Lastly, we evaluate the generalizability of our model on the RoboCup dataset, and get results that are competitive with or better than the state-of-the-art, despite being severely data-starved.


Introduction
We consider the important task of producing a natural language description of a rich world state represented as an over-determined database of event records.This task, which we refer to as selective generation, is often formulated as two subproblems: content selection, which involves choosing a subset of relevant records to talk about from the ex-haustive database, and surface realization, which is concerned with generating natural language descriptions for this subset.Learning to perform these tasks jointly is challenging due to the ambiguity in deciding which records are relevant, the complex dependencies between selected records, and the multiple ways in which these records can be described.
Previous work has made significant progress on this task (Chen and Mooney, 2008;Angeli et al., 2010;Kim and Mooney, 2010;Konstas and Lapata, 2012).However, most approaches solve the two content selection and surface realization subtasks separately, use manual domain-dependent resources (e.g., semantic parsers) and features, or employ template-based generation.This limits domain adaptability and reduces coherence.We take an alternative, neural encoder-aligner-decoder approach to free-form selective generation that jointly performs content selection and surface realization, without using any specialized features, resources, or generation templates.This enables our approach to generalize to new domains.Further, our memorybased model captures the long-range contextual dependencies among records and descriptions, which are integral to this task (Angeli et al., 2010).We formulate our model as an encoder-alignerdecoder framework that uses recurrent neural networks with long short-term memory units (LSTM-RNNs) (Hochreiter and Schmidhuber, 1997) together with a coarse-to-fine aligner to select and "translate" the rich world state into a natural language description.Our model first encodes the full set of over-determined event records using a bidirectional LSTM-RNN.A novel coarse-to-fine aligner arXiv:1509.00838v2[cs.CL] 8 Jan 2016 then reasons over multiple abstractions of the input to decide which of the records to discuss.The model next employs an LSTM decoder to generate natural language descriptions of the selected records.
The use of LSTMs, which have proven effective for similar long-range generation tasks (Sutskever et al., 2014;Vinyals et al., 2015b;Karpathy and Fei-Fei, 2015), allows our model to capture the longrange contextual dependencies that exist in selective generation.Further, the introduction of our proposed variation on alignment-based LSTMs (Bahdanau et al., 2014;Xu et al., 2015) enables our model to learn to perform content selection and surface realization jointly, by aligning each generated word to an event record during decoding.Our novel coarse-to-fine aligner avoids searching over the full set of over-determined records by employing two stages of increasing complexity: a pre-selector and a refiner acting on multiple abstractions (low-and high-level) of the record input.The end-to-end nature of our framework has the advantage that it can be trained directly on corpora of record sets paired with natural language descriptions, without the need for ground-truth content selection.
We evaluate our model on a benchmark weather forecasting dataset (WEATHERGOV) and achieve the best results reported to-date on content selection (12% relative improvement in F-1) and language generation (59% relative improvement in BLEU), despite using no domain-specific resources.We also perform a series of ablations and visualizations to elucidate the contributions of the primary model components, and also show improvements with a simple, k-nearest neighbor beam filter approach.Finally, we demonstrate the generalizability of our model by directly applying it to a benchmark sportscasting dataset (ROBOCUP), where we get results competitive with or better than state-of-the-art, despite being extremely data-starved.

Related Work
Selective generation is a relatively new research area and more attention has been paid to the individual content selection and selective realization subproblems.With regards to the former, Barzilay and Lee (2004) model the content structure from unannotated documents and apply it to the application of text summarization.Barzilay and Lapata (2005) treat content selection as a collective classification problem and simultaneously optimize the local label assignment and their pairwise relations.Liang et al. (2009) address the related task of aligning a set of records to given textual description clauses.They propose a generative semi-Markov alignment model that jointly segments text sequences into utterances and associates each to the corresponding record.
Surface realization is often treated as a problem of producing text according to a given grammar.Soricut and Marcu (2006) propose a language generation system that uses the WIDL-representation, a formalism used to compactly represent probability distributions over finite sets of strings.Wong and Mooney (2007) and Lu and Ng (2011) use synchronous context-free grammars to generate natural language sentences from formal meaning representations.Similarly, Belz (2008) employs probabilistic context-free grammars to perform surface realization.Other effective approaches include the use of tree conditional random fields (Lu et al., 2009) and template extraction within a log-linear framework (Angeli et al., 2010).
Recent work seeks to solve the full selective generation problem through a single framework.Chen and Mooney (2008) and Chen et al. (2010) learn alignments between comments and their corresponding event records using a translation model for parsing and generation.Kim and Mooney (2010) implement a two-stage framework that decides what to discuss using a combination of the methods of Lu et al. (2008) and Liang et al. (2009), and then produces the text based on the generation system of Wong and Mooney (2007).Angeli et al. (2010) propose a unified conceptto-text model that treats joint content selection and surface realization as a sequence of local decisions represented by a log-linear model.Similar to other work, they train their model using external alignments from Liang et al. (2009).Generation then follows as inference over this model, where they first choose an event record, then the record's fields (i.e., attributes), and finally a set of templates that they then fill in with words for the selected fields.Their ability to model long-range dependencies relies on their choice of features for the log-linear model, while the template-based generation further employs some domain-specific features for fluent output.Konstas and Lapata (2012) propose an alternative method that simultaneously optimizes the content selection and surface realization problems.They employ a probabilistic context-free grammar that specifies the structure of the event records, and then treat generation as finding the best derivation tree according to this grammar.However, their method still selects and orders records in a local fashion via a Markovized chaining of records.Konstas and Lapata (2013) improve upon this approach with global document representations.However, this approach also requires alignment during training, which they estimate using the method of Liang et al. (2009).
We treat the problem of selective generation as end-to-end learning via a recurrent neural network encoder-aligner-decoder model, which enables us to jointly learn content selection and surface realization directly from database-text pairs, without the need for an external aligner or ground-truth selection labels.The use of LSTM-RNNs enables our model to capture the long-range dependencies that exist among the records and natural language output.Additionally, the model does not rely on any manually-selected or domain-dependent features, templates, or parsers, and is thereby generalizable.The alignment-RNN approach has recently proven successful for generation-style tasks, e.g., machine translation (Bahdanau et al., 2014) and image captioning (Xu et al., 2015).Since selective generation requires identifying the small number of salient records among an over-determined database, we avoid performing exhaustive search over the full record set, and instead propose a novel coarse-tofine aligner that divides the search complexity into pre-selection and refinement stages.
ROBOCUP We evaluate our model's generalizability on the sportscasting dataset of Chen and Mooney (2008), which consists of only 1539 pairs of temporally ordered robot soccer events (e.g., pass, score) and commentary drawn from the four-game 2001-2004 RoboCup finals (see Fig. 1(b)).Each scenario contains an average of 2.4 event records and a 5.7 word natural language commentary.

The Model
We formulate selective generation as inference over a probabilistic model P (x 1:T |r 1:N ), where r 1:N = (r 1 , r 2 , . . ., r N ) is the input set of over-  determined event records, 1 x 1:T = (x 1 , x 2 , . . ., x T ) is the generated description with x t being the word at time t and x 0 being a special start token: x * 1:T = arg max = arg max The goal of inference is to generate a natural language description for a given set of records.An effective means of learning to perform this generation is to use an encoder-aligner-decoder architecture with a recurrent neural network, which has proven effective for related problems in machine translation (Bahdanau et al., 2014) and image captioning (Xu et al., 2015).We propose a variation on this general model with novel components that are well-suited to the selective generation problem.
Our model (Fig. 2) first encodes each input record r j into a hidden state h j with j ∈ {1, . . ., N } using a bidirectional recurrent neural network (RNN).Our novel coarse-to-fine aligner then acts on a concatenation m j of each record and its hidden state 1 These records may take the form of an unordered set or have a natural ordering (e.g., temporal in the case of ROBOCUP).In order to make our model generalizable, we treat the set as a sequence and use the order specified by the dataset.We note that it is possible that a different ordering will yield improved performance, since ordering has been shown to be important when operating on sets (Vinyals et al., 2015a).
as multi-level representation of the input to compute the selection decision z t at each decoding step t.The model then employs an RNN decoder to arrive at the word likelihood P (x t |x 0:t−1 , r 1:N ) as a function of the multi-level input and the hidden state of the decoder s t−1 at time step t − 1.In order to model the long-range dependencies among the records and descriptions (which is integral to effectively performing selective generation (Angeli et al., 2010;Konstas and Lapata, 2012;Konstas and Lapata, 2013)), our model employs LSTM units as the nonlinear encoder and decoder functions.
Encoder Our LSTM-RNN encoder (Fig. 2) takes as input the set of event records represented as a sequence r 1:N = (r 1 , r 2 , . . ., r N ) and returns a sequence of hidden annotations h 1:N = (h 1 , h 2 , . . ., h N ), where the annotation h j summarizes the record r j .This results in a representation that models the dependencies that exist among the records in the database.
We adopt an encoder architecture similar to that of Graves et al. ( 2013 where T e is an affine transformation, σ is the logistic sigmoid that restricts its input to [0, 1], i e j , f e j , and o e j are the input, forget, and output gates of the LSTM, respectively, and c e j is the memory cell activation vector.The memory cell c e j summarizes the LSTM's previous memory c e j−1 and the current input, which are modulated by the forget and input gates, respectively.Our encoder operates bidirectionally, encoding the records in both the forward and backward directions, which provides a better summary of the input records.In this way, the hidden annotations h j = ( − → h j ; ← − h j ) concatenate forward − → h j and backward ← − h j annotations, each determined using Equation (2c).
Coarse-to-Fine Aligner Having encoded the input records r 1:N to arrive at the hidden annotations h 1:N , the model then seeks to select the content at each time step t that will be used for generation.Our model performs content selection using an extension of the alignment mechanism proposed by Bahdanau et al. (2014), which allows for selection and generation that is independent of the ordering of the input.
In selective generation, the given set of event records is over-determined with only a small subset of salient records being relevant to the output natural language description.Standard alignment mechanisms limit the accuracy of selection and generation by scanning the entire range of overdetermined records.In order to better address the selective generation task, we propose a coarse-tofine aligner that prevents the model from being distracted by non-salient records.Our model aligns based on multiple abstractions of the input: both the original input record as well as the hidden annotations m j = (r j ; h j ) , an approach that has previously been shown to yield better results than aligning based only on the hidden state (Mei et al., 2015).
Our coarse-to-fine aligner avoids searching over the full set of over-determined records by using two stages of increasing complexity: a pre-selector and refiner (Fig. 2).The pre-selector first assigns to each record a probability p j of being selected, while the standard aligner computes the alignment likelihood w tj over all the records at each time step t during decoding.Next, the refiner produces the final selection decision by re-weighting the aligner weights w tj with the pre-selector probabilities p j : p j = sigmoid q tanh(P m j ) (3a) α tj = p j w tj / j p j w tj (3d) where P , q, U , W , v are learned parameters.Ideally, the selection decision would be based on the highestvalue alignment z t = m k where k = arg max j α tj .However, we use the weighted average (Eqn.3e) as its soft approximation to maintain differentiability of the entire architecture.
The pre-selector assigns large values (p j > 0.5) to a small subset of salient records and small values (p j < 0.5) to the rest.This modulates the standard aligner, which then has to assign a large weight w tj in order to select the j-th record at time t.In this way, the learned prior p j makes it difficult for the alignment (attention) to be distracted by nonsalient records.Further, we can relate the output of the pre-selector to the number of records that are selected.Specifically, the output p j expresses the extent to which the j-th record should be selected.The summation N j=1 p j can then be regarded as a real-valued approximation to the total number of pre-selected records (denoted as γ), which we regularize towards, based on validation (see Eqn. 5).
Decoder Our architecture uses an LSTM decoder that takes as input the current context vector z t , the last word x t−1 , and the LSTM's previous hidden state s t−1 .The decoder outputs the conditional probability distribution P x,t = P (x t |x 0:t−1 , r 1:N ) over the next word, represented as a deep output layer (Pascanu et al., 2014), where E (an embedding matrix), L 0 , L s , and L z are parameters to be learned.

Training and Inference
We train the model using the database-record pairs (r 1:N , x 1:T ) from the training corpora so as to maximize the likelihood of the ground-truth language description x * 1:T (Eqn.1).Additionally, we introduce a regularization term ( N j=1 p j − γ)2 that enables the model to influence the pre-selector weights based on the aforementioned relationship between the output of the preselector and the number of selected records.Moreover, we also introduce the term (1.0 − max(p j )), which accounts for the fact that at least one record should be pre-selected.Note that when γ is equal to N , the pre-selector is forced to select all the records (p j = 1.0 for all j), and the coarse-to-fine alignment reverts to the standard alignment introduced by Bahdanau et al. (2014).Together with the negative loglikelihood of the ground-truth description x * 1:T , our loss function becomes Having trained the model, we generate the natural language description by finding the maximum a posteriori words under the learned model (Eqn.1).For inference, we perform greedy search starting with the first word x 1 .Beam search offers a way to perform approximate joint inference -however, we empirically found that beam search does not perform any better than greedy search on the datasets that we consider, an observation that is shared with previous work (Angeli et al., 2010).We later discuss an alternative k-nearest neighbor-based beam filter (see Sec 6.2).

Experimental Setup
Datasets We analyze our model on the benchmark WEATHERGOV dataset, and use the data-starved ROBOCUP dataset to demonstrate the model's generalizability.Following Angeli et al. (2010), we use WEATHERGOV training, development, and test splits of size 25000, 1000, and 3528, respectively.For ROBOCUP, we follow the evaluation methodology of previous work (Chen and Mooney, 2008), performing three-fold cross-validation whereby we train on three games (approximately 1000 scenarios) and test on the fourth.Within each split, we hold out 10% of the training data as the development set to tune the early-stopping criterion and γ.We then report the standard average performance (weighted by the number of scenarios) over these four splits.
Training Details On WEATHERGOV, we lightly tune the number of hidden units and γ on the development set according to the generation metric (BLEU), and choose 500 units from {250, 500, 750} and γ = 8.5 from {6.5, 7.5, 8.5, 10.5, 12.5}.For ROBOCUP, we only tune γ on the development set and choose γ = 5.0 from the set {1.0, 2.0, . . ., 6.0}.However, we do not retune the number of hidden units on ROBOCUP.For each iteration, we randomly sample a mini-batch of 100 scenarios during back-propagation and use Adam (Kingma and Ba, 2015) for optimization.Training typically converges within 30 epochs.We select the model according to the BLEU score on the development set. 2 Evaluation Metrics We consider two metrics as a means of evaluating the effectiveness of our model on the two selective generation subproblems.For content selection, we use the F-1 score of the set of selected records as defined by the harmonic mean of precision and recall with respect to the ground-truth selection record set.We define the set of selected records as consisting of the record with the largest selection weight α ti computed by our aligner at each decoding step t.
We evaluate the quality of surface realization using the BLEU score3 (a 4-gram matching-based precision) (Papineni et al., 2001) of the generated description with respect to the human-created reference.To be comparable to previous results on WEATHERGOV, we also consider a modified BLEU score (cBLEU) that does not penalize numerical deviations of at most five (Angeli et al., 2010) (i.e., to not penalize "low around 58" compared to a reference "low around 60").On ROBOCUP, we also evaluate the BLEU score in the case that groundtruth content selection is known (sBLEU G ), to be comparable to previous work.

Results and Analysis
We analyze the effectiveness of our model on the benchmark WEATHERGOV (as primary) and ROBOCUP (as generalization) datasets.We also present several ablations to illustrate the contributions of the primary model components.We report the performance of content selection and surface realization using F-1 and two BLEU scores (standard sBLEU and the customized cBLEU of Angeli et al. ( 2010)), respectively (Sec.5).Table 1 compares our test results against previous methods that include KL12 (Konstas and Lapata, 2012), KL13 (Konstas and Lapata, 2013), and ALK10 (Angeli et al., 2010).Our method achieves the best results reported to-date on all three metrics, with relative improvements of 11.94% (F-1), 58.88% (sBLEU), and 36.68%(cBLEU) over the previous state-of-the-art.

Beam Filter with k-Nearest Neighbors
We considered beam search as an alternative to greedy search in our primary setup (Eqn.1), but this performs worse, similar to what previous work found on this dataset (Angeli et al., 2010).As an alternative, we consider a beam filter based on a knearest neighborhood.See Supplementary Material for details.Table 9 shows that this k-NN beam filter improves results over the primary greedy results.Aligner Ablation First, we evaluate the contribution of our proposed coarse-to-fine aligner by comparing our model with the basic encoder-alignerdecoder model introduced by Bahdanau et al. (2014).Table 3 reports the results demonstrating that our aligner yields superior F-1 and BLEU scores relative to a standard aligner.Encoder Ablation Next, we consider the effectiveness of the encoder.Table 4 compares the results with and without the encoder on the development set, and demonstrates that there is a significant gain from encoding the event records using the LSTM-RNN.We attribute this improvement to the LSTM-RNN's ability to capture the relationships that exist among the records, which is known to be essential to selective generation (Barzilay and Lapata, 2005;Angeli et al., 2010).Output Examples Fig. 3 shows an example record set with its output description and recordword alignment heat map.As shown, our model learns to align records with their corresponding words (e.g., windDir and "southeast," temperature and "71," windSpeed and "wind 10," and gust and "winds could gust as high as 30 mph").It also learns the subset of salient records to talk about (matching the ground-truth description perfectly for this example, i.e., a standard BLEU of 100.00).We also see some word-level mismatch, e.g., "cloudy" misaligns to id-0 temp and id-10 precipChance, which we attribute to the high correlation between these types of records ("garbage collection" in Liang et al. (2009)).

Word Embeddings
Training our decoder has the effect of learning embeddings for the words in the training set (via the embedding matrix E in Eqn. 4).
embeddings capture semantic relationships among the training words.Table 10 presents nearest neighbor words for some of the common words from the WEATHERGOV dataset (according to cosine similarity in the embedding space).More details of other embedding approaches that we tried are discussed in the Supplementary Material section.We use the ROBOCUP dataset to evaluate the domain-independence of our model.The dataset is severely data-starved with only 1000 (approx.)training pairs, which is much smaller than is typically necessary to train RNNs.This results in higher variance in the trained model distributions, and we thus adopt the standard denoising method of ensembles (Sutskever et al., 2014;Vinyals et al., 2015b;Zaremba et al., 2014).5Following previous work, we perform two experiments on the ROBOCUP dataset (Table 6), the first considering full selective generation and the second assuming ground-truth content selection at test time.On the former, we obtain a standard BLEU score (sBLEU) of 25.28, which exceeds the best score of 24.88 (Konstas and Lapata, 2012).Additionally, we achieve an selection F-1 score of 81.58, which is also the best result reported to-date.In the case of assumed (known) ground-truth content selection, our model attains an sBLEU G score of 29.40, which is competitive with the state-of-the-art.6

Conclusion
We presented an encoder-aligner-decoder model for selective generation that does not use any specialized features, linguistic resources, or genera-tion templates.Our model employs a bidirectional LSTM-RNN model with a novel coarse-tofine aligner that jointly learns content selection and surface realization.We evaluate our model on the benchmark WEATHERGOV dataset and achieve state-of-the-art selection and generation results.We achieve further improvements via a k-nearest neighbor beam filter.We also present several model ablations and visualizations to elucidate the effects of the primary components of our model.Moreover, our model generalizes to a different, data-starved domain (ROBOCUP), where it achieves results competitive with or better than the state-of-the-art.As an alternative, we consider a beam filter based on a k-nearest neighborhood.First, we generate the M -best description candidates (i.e., a beam width of M ) for a given input record set (database) using standard beam search.Next, we find the K nearest neighbor database-description pairs from the training data, based on the cosine similarity of each neighbor database with the given input record.We then compute the BLEU score for each of the M description candidates relative to the K nearest neigh-bor descriptions (as references) and select the candidate with the highest BLEU score.We tune K and M on the development set and report the results in Table 8.Table 9 presents the test results with this tuned setting (M = 2, K = 1), where we achieve BLEU scores better than our primary greedy results.Training our decoder has the effect of learning embeddings for the words in the training set (via the embedding matrix E in Eqn. 4).Here, we explore the extent to which these learned embeddings capture semantic relationships among the training words.Table 10 presents nearest neighbor words for some of the common words from the WEATHER-GOV dataset (according to cosine similarity in the embedding space).We also consider different ways of using pretrained word embeddings (Mikolov et al., 2013) to bootstrap the quality of our learned embeddings.One approach initializes our embedding matrix with the pre-trained vectors and then refines the embedding based on our training corpus.The second concatenates our learned embedding matrix with the pre-trained vectors in an effort to simultaneously exploit general similarities as well as those learned for the domain.As shown previously for other tasks (Vinyals et al., 2014;Vinyals et al., 2015b), we find that the use of pre-trained embeddings results in negligible improvements (on the development set).
Figure 1: Sample database-text pairs chosen from the (a) WEATHERGOV and (b) ROBOCUP datasets.

Figure 2 :
Figure 2: Our model architecture with a bidirectional LSTM encoder, coarse-to-fine aligner, and decoder.

Table 5 :
Nearest neighbor word for example words

Table 7 :
Effect of beam width

Table 10 :
Nearest neighbor word for example words