An Encoder with non-Sequential Dependency for Neural Data-to-Text Generation

Data-to-text generation aims to generate descriptions given a structured input data (i.e., a table with multiple records). Existing neural methods for encoding input data can be divided into two categories: a) pooling based encoders which ignore dependencies between input records or b) recurrent encoders which model only sequential dependencies between input records. In our investigation, although the recurrent encoder generally outperforms the pooling based encoder by learning the sequential dependencies, it is sensitive to the order of the input records (i.e., performance decreases when injecting the random shuffling noise over input data). To overcome this problem, we propose to adopt the self-attention mechanism to learn dependencies between arbitrary input records. Experimental results show the proposed method achieves comparable results and remains stable under random shuffling over input data.


Introduction
Data-to-text generation, one classic task of natural language generation, aims to produce a piece of texts that adequately and fluently describes its structured input data (i.e., tables) (Kukich, 1983;Reiter and Dale, 1997;Barzilay and Lapata, 2005;Angeli et al., 2010;Kim and Mooney, 2010;Perez-Beltrachini and Gardent, 2017).Traditionally, it is divided into two subtasks: content selection (i.e., what to say) and the surface realization (i.e., how to say) (Reiter and Dale, 1997;Gatt and Krahmer, 2018).Recent neural generation systems ignore the distinction of these two subtasks using a single encoder-decoder model (Sutskever et al., 2014) with attention mechanism (Bahdanau et al., 2015;Mei et al., 2016;Dušek and Jurcicek, 2016;Kiddon et al., 2016;Chisholm et al., 2017).The encoder-decoder architecture first encodes the input data (e.g., a table) into a dense representation, where the input contains a set of records.Then descriptions are produced based on the input representation.The appropriate encoding method of input structured data remains an open question.Existing encoding methods for an input table can be decomposed into two stages: 1) converting each record in the table to a record vector, 2) combining the record vectors using a pooling method or a recurrent neural network (RNN) to represent the input table.In this paper, we investigate these two types of neural encoding methods over several data-to-text datasets.The empirical results show that RNN based methods outperform simple pooling methods in terms of BLEU evaluation.
The major difference between pooling and RNN based methods lies in the fact that pooling methods treat records in the input table independently while RNN based methods model the relationships among the records by treating the input records as a sequence.As a result, it is common that two records in the input data are relevant.For example, as shown in

Method
The neural data-to-text generation is based on the encoder-decoder architecture.As shown in Figure 1 , there are multiple choices of table encoding that affect the generation decoder.We briefly introduce the backbone of the neural generation method in Section 2.1 and then introduce the details of three types of table encoders in Section 2.2.

Base Model
Given a set of records S = {r j } K j=1 , the goal of data-to-text generation is to produce a description y = y 1 , ..., y T .Usually, the encoder-decoder architecture consist of a table encoder and a recurrent neural network based decoder segmented with attention (Bahdanau et al., 2015) and conditional copy (See et al., 2017) mechanism.Firstly, each input record r j is encoded into a hidden vector h j using a specified table encoder, which is the focus of this paper and three encoders will be introduced in Section 2.2.Then, for the generated description y, the decoder generates the word y t at the t-th time step based on the previously generated words y <t and the input hidden vectors H = {h j } K j=1 .Specifically, where f (.) is a tanh function and d t = LSTM(d t−1 , y t−1 , , c t−1 ) is the hidden state of the decoder at step t. c t in Eq. 1 is the context vector at timestep t, computed as a weighted sum of input hidden vectors h j : where we use the attention model introduced in (Bahdanau et al., 2015) to compute the attention weight α t,j .

Table Encoder
Record Vectors: The input table can be viewed as a set of field-value records, where values are sequences of words corresponding to a certain field (Liang et al., 2009;Lebret et al., 2016;Yang et al., 2017).For instance, in the Table 1, the word "William" has the field "Birth name" and it is the first word in this field.Every word in the field is a record r and is presented as triple (r v , r f , r pos ), where r v , r f and r pos refer to the value (e.g., William), the field name (e.g., Birth name), the relative position in its field (e.g., 0).We map each record r ∈ S into a vector r by concatenating the embedding of r v , r f and r pos , denoted as r = [e v , e f , e pos ], where e v , e f , e pos are trainable word embeddings of r v , r f and r pos .
Pooling Based Encoders: The pooling based encoder treats input records independently, therefore, it first applies a feed forward neural network layer over every record vector r j and yields the input hidden vector h j = tanh(W r j ), where W is a trainable parameter.The initial context vector c 0 in Eq. 1 is calculated by the following maxpooling layer.dependency among the records by treating the set of record vectors r 1 , ..., r T as a sequence.The sequence of records are fed into a RNN yielding a sequence of input hidden vectors h 1 , ...h T .We adopt a bidirectional LSTM by following (Mei et al., 2016).The initial context vector is set as the last hidden vector of the sequence c 0 = h T .
Self-Attention Encoders: For data-to-text generation, input records are order invariant as input data should convey the same information regardless of the order of input records.The input records is a set, while the recurrent encoder makes strong hypothesis and treats it as a sequence.
Therefore an ideal table encoder has two desired properties: a) enable to capture relationships among the input records and b) is also order invariant.Recently proposed self-attention mechanism (Vaswani et al.;Wang et al., 2017) is able to learn interactions between arbitrary records and therefore is also irrelevant to the order of the records.For this purpose, we adapt the multi-layer selfattention mechanism for the encoding.Each layer has two sub-layers: one layer is for multi-head self-attention and the other one is a position wise feed-forward neural network with layer normalization (Vaswani et al.).Specifically, where Q l , K l , V l are trainable parameters for layer l, d k is the dimension of K, and the first layer of the hidden vector h 0 i refers to the record vector r i .To represent the full table, we apply max-pooling in Eq.3 using the last layer of hidden vectors similarly.For evaluation metrics, we use BLEU-4 (Papineni et al., 2002) to assess the generation quality automatically.

Implementation Details
We tune all hyper-parameters according to the performance on the separated validation set.The dimension of trainable embeddings and hidden units in LSTMs are all set to 600.For the multi-layer and multi-head architecture, 3 layers and 4 multiattention heads are used.During training, we regularize all layers with a dropout 0.1.For optimization, we use Adam with learning rate 0.0002.The gradient is truncated by 1.All experiments use beam size of 5 in decoding.We use pytorch version of OpenNMT (Klein et al., 2017) for implementation.

Performance
The results of different input encoding methods along with other competing systems on the test sets of three datasets are shown in Table 3.We compare three types of encoders (i.e., Pooling based encoders refer to MaxEnc, Recurrent encoders refer to RnnEnc, and self-attention encoders refer SelfAtt) introduced in Section 2.2 with the following generation systems: (1) Template is a method that replaces the words occurring in both the table and the training sentences with a special token reflecting its field.( 2) StructAware (Liu et al., 2018) is a structureaware encoder-decoder architecture which using a modified LSTM unit and a specific attention mechanism to incorporate the attribute information.
(3) Slug2Slug is an ensemble neural method that re-ranks several neural outputs during inference.
From table 3, the results show that three encoders achieves comparable results on E2E dataset, as the input of E2E is relatively short and simple.For WIKIBIO, the simple max-pooling encoder MaxEnc performs worse than the bidirectional LSTM encoder RnnEnc.The proposed method SelfAtt which capture the dependencies between arbitrary records yields better results compared to MaxEnc, and achieves comparable results with respect to RnnEnc.The result suggests modeling the dependencies among input records can yield better performance when the input is long and complex.More importantly, recurrent encoders only capture sequential dependencies, which is sensitive to the order of input records.To investigate the robustness of different table encoders, we design random shuffling noise over input data.For example, the original order of the input is "Birth name; Genres; Occupation; Associated Acts", and the order after random shuffling can be "'Geners; Birth name; Associated Acts; Occupation'.Note that we do not change the order of content inside a field.We apply such random shuffling noise on both training and testing stages.For model training, there are two choices: training on original input data Original or data with input random shuffling noise Shuffle.For testing, the trained model can be applied to the original input data or the shuffled version.
From the  et al., 2018), where we limit the self-attention mechanism to capture dependencies within a fixed window size (set to 10 in the experiments), the result of ReSAtt performs comparable with respect to RnnEnc, despite this type of method is not order invariant.Handling long range dependencies for input data that is non-sensitive to the order of input data is a potential future work.

Conclusion
In this paper, we analyze several existing encoding methods for neural data-to-text generation.We find that modeling the dependency among the input records can yield better generation results.However, current recurrent table encoders can only model the sequential dependencies which is sensitive to the input order.We propose using self-attention for table encoder that can capture the dependencies and remains stable at the same time.
In the future, we will analyze the explicit dependencies that lies in the input data, and improve the performance of encoding methods.

Figure 1 :
Figure 1: Overview of encoder-decoder architecture with different encoding methods.

Table 1 :
An example of generating descriptions from the input data.

Table 2 :
) Statistics of two datasets.

Table 4 :
Experimental results of different encoding methods trained and tested on different input settings

Table 4 ,
(Nie et al., 2018)e per-SelfAtt on a recently proposed NBA dataset ESPN(Nie et al., 2018), where the average input length 165.9 and average field number is 134.2.The results on ESPN are shown in Table5.The SelfAtt has difficulty in learning such long range dependencies and performs worse than RnnEnc.When applying a restricted self-attention ReSAtt (Wang