A Mixed Hierarchical Attention Based Encoder-Decoder Approach for Standard Table Summarization

Structured data summarization involves generation of natural language summaries from structured input data. In this work, we consider summarizing structured data occurring in the form of tables as they are prevalent across a wide variety of domains. We formulate the standard table summarization problem, which deals with tables conforming to a single predefined schema. To this end, we propose a mixed hierarchical attention based encoder-decoder model which is able to leverage the structure in addition to the content of the tables. Our experiments on the publicly available weathergov dataset show around 18 BLEU (around 30%) improvement over the current state-of-the-art.


Introduction
Abstractive summarization techniques from structured data seek to exploit both structure and content of the input data.The type of structure on the input side can be highly varied ranging from key-value pairs (e.g.WIKIBIO (Lebret et al., 2016)), source code (Iyer et al., 2016), ontologies (Androutsopoulos et al., 2014;Colin et al., 2016), or tables (Wiseman et al., 2017), each of which require significantly varying approaches.In this paper, we focus on generating summaries from tabular data.Now, in most practical applications such as finance, healthcare or weather, data in a table are arranged in rows and columns where the schema is known beforehand.However, change in the actual data values can necessitate drastically different output summaries.Examples shown in the figure 1 have a predefined schema obtained from the WEATHERGOV dataset (Liang et al., 2009) and its corresponding weather report summary.Therefore, the problem that we seek to address in this paper is to generate abstractive summaries of tables conforming to a pre-defined fixed schema (as opposed to cases where the schema is unknown).We refer to this setting as standard table summarization problem.Another problem that could be formulated is one in which the output summary is generated from multiple tables as proposed in a recent challenge (Wiseman et al., 2017) (this setting is out of the scope of this paper).Now, as the schema is fixed, simple rule based techniques (Konstas and Lapata, 2013) or template based solutions could be employed.However, due to the vast space of selection (which attributes to use in the summary based on the current value it takes) and generation (how to express these selected attributes in natural language) choices possible, such approaches are not scalable in terms of the number of templates as they demand hand-crafted rules for both selection and generation.
We attempt to solve the problem of standard table summarization by leveraging the hierarchical nature of fixed-schema tables.In other words, rows consist of a fixed set of attributes and a table is defined by a set of rows.We cast this problem into a mixed hierarchical attention model following the encode-attend-decode (Cho et al., 2015) paradigm.In this approach, there is static attention on the attributes to compute the row representation followed by dynamic attention on the rows, which is subsequently fed to the decoder.This formulation is theoretically more efficient than the fully dynamic hierarchical attention framework followed by Nallapati et al. (2016).Also, our model does not need sophisticated sampling or sparsifying techniques like (Ling and Rush, 2017;Deng et al., 2017), thus, retaining differentiability.To demonstrate the efficacy of our approach, we transform the publicly available WEATHERGOV dataset (Liang et al., 2009) into fixed-schema tables, which is then used for our experiments.Our proposed mixed hierarchi-  cal attention model provides an improvement of around 18 BLEU (around 30%) over the current state-of-the-art result by Mei et al. (2016).

Tabular Data Summarization
A standard table consist of set of records (or rows) R = (r 1 , r 2 , ...r T ) and each record r has a fixed set of attributes (or columns) A r = (a r1 , a r2 , ...a rM ).Tables in figure 1 have 7 columns (apart from 'TYPE') which correspond to different attributes.Also U = (u 1 , u 2 , ...u T ) represents the type of each record where u k is onehot encoding for the record type for record r k .Training data consists of instance pairs (X i , Y i ) for i = 1, 2, ..n, where X i = (R i , U i ) represents the input table and Y i = (y 1 , ..., y T ) represents the corresponding natural language summary.In this paper, we propose an end-to-end model which takes in a table instance X to produce the output summary Y .This can be derived by solving in Y the following conditional probability objective: (1)

Mixed Hierarchical Attention Model (MHAM)
Our model is based on the encode-attend-decode paradigm as defined by Cho et al. (2015).It consists of an encoder RNN which encodes a variable length input sequence x = (x 1 , ..., x T ) into a representation sequence c = (c 1 , ..., c T ).Another decoder RNN generates sequence of output symbols As illustrated in figure 2, our encoder is not a single RNN.The encoder has a hierarchical structure to leverage the structural aspect of a standard table: a table consists of a set of records (or rows) and each record consists of values corresponding to a fixed set of attributes.We call it a mixed hierarchical attention based encoder, as it incorporates static attention and dynamic attention at two different levels of the encoder.At the record level, the attention over record representations is dynamic as it changes with each decoder time step.Whereas at the attribute level, since the schema is fixed, a record representation can be computed without the need of varying attention over attributes -thus static attention is used.For example, with respect to WEATHERGOV dataset, a temperature record will always be defined by the attributes like min, max and mean irrespective of the decoder time step.So, attention over attributes can be static.On the other hand, while generating a word by the decoder, there can be a preference of focusing on a record type say, temperature, over some other type say, windSpeed.Thus, dynamic attention is used across records.
Capturing attribute semantics: We learn record type embeddings and use them to calculate attentions over attributes.For the trivial case of all records being same type, it boils down to having a single record type embedding.Given attributes A r for a record r, where each attribute a r i is encoded into a vector A r i based on the attribute type (discussed further in section 3), using equation 2 we embed each attribute where W j is the embedding matrix for j th attribute.We embed record type one-hot vector u r through W 0 , which is used to compute the importance score I r j for attribute j in record r according to equation 3.
Static Attribute attention: Not all attribute values contribute equally to the record.Hence, we introduce attention weights for attributes of each record.These attention weights are static and does not change with decoder time step.We calculate the attention probability vector α r over attributes using the attribute importance vector I r .The attention weights can then be used to calculate the record representation B r for record r by using equations 4 and 5.
Record Encoder: A GRU based RNN encoder takes as input a sequence of attribute attended records B 1:N and returns a sequence of hidden states h 1:N , where h r is the encoded vector for record B r .We obtain the final record encoding c r (equation 6) by concatenating the GRU hidden states with the embedded record encodings B r .
Static Record attention: In a table, a subset of record types can always be more salient compared to other record types.This is captured by learning a static set of weights over all the records.These weights regulate the dynamic attention weights computed during decoding at each time step.Equation 7 performs this step where g r is the static record attention weight for r th record and q and P are weights to be learnt.We do not have any constraints on static attention vector.
Dynamic Record attention for Decoder: Our decoder is a GRU based decoder with dynamic attention mechanism similar to (Mei et al., 2016) with modifications to modulate attention weights at each time step using static record attentions.At each time step t attention weights are calculated using 8, 9, 10, where γ r t is the aggregated attention weight of record r at time step t.We use the soft attention over input encoder sequences c r to calculate the weighted average, which is passed to the GRU.GRU hidden state s t is used to calculate output probabilities p t by using a softmax as described by equation 11, 12, 13, which is then used to get output word y t .
Due to the static attention at attribute level, the time complexity of a single pass is O(T M + T T ), where T is the number of records, M is the number of attributes and T is the number of decoder steps.In case of dynamic attention at both levels (as in Nallapati et al. (2016)), the time complexity is much higher O(T M T ).Thus, mixed hierarchical attention model is faster than fully dynamic hierarchical attention.For better understanding of the contribution of hierarchical attention(MHAM), we propose a simpler non-hierarchical (NHM) architecture with attention only at record level.In NHM, B r is calculated by concatenating all the record attributes along with corresponding record type.Reference: Periods of rain and possibly a thunderstorm .Some of the storms could produce heavy rain .Temperature rising to near 51 by 10am , then falling to around 44 during the remainder of the day .Breezy , with a north northwest wind between 10 and 20 mph .Chance of precipitation is 90 % .New rainfall amounts between one and two inches possible .
NHM:Periods of rain and possibly a thunderstorm .Some of the storms could produce heavy rain .Temperature rising to near 51 by 8am , then falling to around 6 during the remainder of the day .Breezy , with a north northwest wind 10 to 15 mph increasing to between 20 and 25 mph .Chance of precipitation is 90 % .New rainfall amounts between one and two inches possible .
MHAM: Periods of rain and possibly a thunderstorm .Some of the storms could produce heavy rain .Temperature rising to near 51 by 8am , then falling to around 44 during the remainder of the day .Breezy , with a north northwest wind between 10 and 20 mph .Chance of precipitation is 90 % .New rainfall amounts between one and two inches possible.Reference: Rain and snow likely , becoming all snow after 8pm .Cloudy , with a low around 22 .South southwest wind around 15 mph .Chance of precipitation is 60% .New snow accumulation of less than one inch possible .
NHM : Rain or freezing rain likely before 8pm , then snow after 11pm , snow showers and sleet likely before 8pm , then a chance of rain or freezing rain after 3am .Mostly cloudy , with a low around 27 .South southeast wind between 15 and 17 mph .Chance of precipitation is 80% .Little or no ice accumulation expected .Little or no snow accumulation expected .
MHAM: Snow , and freezing rain , snow after 9pm .Cloudy , with a steady temperature around 23 .Breezy , with a south wind between 15 and 20 mph .Chance of precipitation is 60% .New snow accumulation of around an inch possible .

Experiments
Dataset and methodology: To evaluate our model we have used WEATHERGOV dataset (Liang et al., 2009) which is the standard benchmark dataset to evaluate tabular data summarization techniques.
We compared the performance of our model against the state-of-the-art work of MBW (Mei et al., 2016), as well as two other baseline models KL (Konstas and Lapata, 2013) and ALK (Angeli et al., 2010)

Training and hyperparameter tuning
We used TensorFlow (Abadi et al., 2015) for our experiments.Encoder embeddings were initialized by generating the values from a uniform distribution in the range [-1, 1).Other variables were initialized using Glorot uniform initialization (Glorot and Bengio, 2010).We tune each hyperparameter by choosing parameter from a ranges of values, and selected the model with best sBLEU score in validation set over 500 epochs.We did not use any regularization while training the model.For both the models, the hyperparameter tuning was separately performed to give both models a fair chance of performance.For both the models, Adam optimizer (Kingma and Ba, 2014) was used with learning rate set to 0.0001.We found embedding size of 100, GRU size of 400, static record attention sizeP of 150 to work best for MHAM model.We also experimented using bi-directional GRU in the encoder but there was no significant boost observed in the BLEU scores.
Evaluation metrics: To evaluate our models we employed BLEU and Rouge-L scores.In addition to the standard BLEU (sBleu) (Papineni et al., 2002), a customized BLEU (cBleu) (Mei et al., 2016) has also been reported.cBleu does not penalize numbers which differ by at most five; hence 20 and 18 will be considered same.

Results and Analyses
Table 2 describes the results of our proposed models (MHAM and NHM) along with the aforementioned baseline models.We observe a significant performance improvement of 16.6 cBleu score (24%) and 18.3 sBleu score (30%) compared to the current state-of-the-art model of MBW.MHAM also shows an improvement over NHM in all metrics demonstrating the importance of hierarchical attention.Attention analysis: Analysis of figure 3 reveals that the learnt attention weights are reasonable.For example, as shown in figure 3(a), for the phrase 'with a high near 52', the model had a high attention on temperature before and while generating the number '52'.Similarly while generating 'mostly cloudy', the model had a high attention on precipitation potential.Attribute attentions are also learned as expected (in figure 3(b)).The temperature, wind speed and gust records have high weights on min/max/mean values which describe these records.
Qualitative analysis: Table 1 contains example table-summary pairs, with summary generated by the proposed hierarchical and non-hierarchical versions.We observe that our model is able to generate numbers more accurately by enabling hierar- chical attention.Our model is also able to capture weak signals like snow accumulation.Further, our proposed model MHAM is able to avoid repetition as compared to NHM.

Conclusion and Future Work
In this work, we have formulated the problem of standard table summarization where all the tables come from a predefined schema.Towards this, we proposed a novel mixed hierarchical attention based encoder-decoder approach.Our experiments on the publicly available WEATHERGOV benchmark dataset have shown significant improvements over the current state-of-the-art work.Moreover, this proposed method is theoretically more efficient compared to the current fully dynamic hierarchical attention model.As future work, we propose to tackle general tabular summarization where the schema can vary across tables in the whole dataset.

Figure
Figure 1: StandardTable Summarization with fixed schema tables as input
Table Summarization with fixed schema tables as input Reference: A chance of rain and snow .Snow level 5500 feet .Mostly cloudy , with a low around 31 .Calm wind becoming north northeast around 6 mph .Chance of precipitation is 40% .NHM: A chance of rain and snow .Mostly cloudy , with a low around 31 .North northwest wind at 6 mph becoming east southeast .Chance of precipitation is 40% .MHAM: A chance of rain and snow .Snow level 5800 feet lowering to 5300 feet after midnight .Mostly cloudy , with a low around 31 .North northwest wind at 6 mph becoming south southwest .Chance of precipitation is 40% .

Table 2 :
Overall results