Generating Descriptions from Structured Data Using a Bifocal Attention Mechanism and Gated Orthogonalization

In this work, we focus on the task of generating natural language descriptions from a structured table of facts containing fields (such as nationality, occupation, etc) and values (such as Indian, actor, director, etc). One simple choice is to treat the table as a sequence of fields and values and then use a standard seq2seq model for this task. However, such a model is too generic and does not exploit task specific characteristics. For example, while generating descriptions from a table, a human would attend to information at two levels: (i) the fields (macro level) and (ii) the values within the field (micro level). Further, a human would continue attending to a field for a few timesteps till all the information from that field has been rendered and then never return back to this field (because there is nothing left to say about it). To capture this behavior we use (i) a fused bifocal attention mechanism which exploits and combines this micro and macro level information and (ii) a gated orthogonalization mechanism which tries to ensure that a field is remembered for a few time steps and then forgotten. We experiment with a recently released dataset which contains fact tables about people and their corresponding one line biographical descriptions in English. In addition, we also introduce two similar datasets for French and German. Our experiments show that the proposed model gives 21% relative improvement over a recently proposed state of the art method and 10% relative improvement over basic seq2seq models. The code and the datasets developed as a part of this work are publicly available on https://github.com/PrekshaNema25/StructuredData_To_Descriptions

In this work, we focus on the task of generating natural language descriptions from a structured table of facts containing fields (such as nationality, occupation, etc) and values (such as Indian, {actor, director}, etc).One simple choice is to treat the table as a sequence of fields and values and then use a standard seq2seq model for this task.However, such a model is too generic and does not exploit taskspecific characteristics.For example, while generating descriptions from a table, a human would attend to information at two levels: (i) the fields (macro level) and (ii) the values within the field (micro level).Further, a human would continue attending to a field for a few timesteps till all the information from that field has been rendered and then never return back to this field (because there is nothing left to say about it).To capture this behavior we use (i) a fused bifocal attention mechanism which exploits and combines this micro and macro level information and (ii) a gated orthogonalization mechanism which tries to ensure that a field is remembered for a few time steps and then forgotten.We experiment with a recently released dataset which contains fact tables about people and their corresponding one line biographical descriptions in English.In addition, we also introduce two similar datasets for French and German.Our experiments show that the proposed model gives 21% relative improvement over a recently proposed state of the art method and 10% relative improvement over basic seq2seq models.The code and the datasets developed as a part of this work are publicly available. 1

Introduction
Rendering natural language descriptions from structured data is required in a wide variety of commercial applications such as generating descriptions of products, hotels, furniture, etc., from a corresponding table of facts about the entity.Such a table typically contains {field, value} pairs where the field is a property of the entity (e.g., color) and the value is a set of possible assignments to this property (e.g., color = red).Another example of this is the recently introduced task of generating one line biography descriptions from a given Wikipedia infobox (Lebret et al., 2016).The Wikipedia infobox serves as a table of facts about a person and the first sentence from the corresponding article serves as a one line description of the person.Figure 1 illustrates an example input infobox which contains fields such as Born, Residence, Nationality, Fields, Institutions and Alma Mater.Each field further contains some words (e.g., particle physics, many-body theory, etc.).The corresponding description is coherent with the information contained in the infobox.
Note that the number of fields in the infobox and the ordering of the fields within the infobox varies from person to person.Given the large size (700K examples) and heterogeneous nature of the dataset which contains biographies of people from different backgrounds (sports, politics, arts, etc.), it is hard to come up with simple rule-based templates for generating natural language descriptions from infoboxes, thereby making a case for datadriven models.Based on the recent success of data-driven neural models for various other NLG tasks (Bahdanau et al., 2014;Rush et al., 2015;Yao et al., 2015;Chopra et al., 2016;Nema et al., 2017), one simple choice is to treat the infobox as a sequence of {field, value} pairs and use a standard seq2seq model for this task.However, such a model is too generic and does not exploit the specific characteristics of this task as explained below.
First, note that while generating such descriptions from structured data, a human keeps track of information at two levels.Specifically, at a macro level, she would first decide which field to mention next and then at a micro level decide which of the values in the field needs to be mentioned next.For example, she first decides that at the current step, the field occupation needs attention and then decides which is the next appropriate occupation to attend to from the set of occupations (actor, director, producer, etc.).To enable this, we use a bifocal attention mechanism which computes an attention over fields at a macro level and over values at a micro level.We then fuse these attention weights such that the attention weight for a field also influences the attention over the values within it.Finally, we feed a fused context vector to the decoder which contains both field level and word level information.Note that such two-level attention mechanisms (Nallapati et al., 2016;Yang et al., 2016;Serban et al., 2016) have been used in the context of unstructured data (as opposed to structured data in our case), where at a macro level one needs to pay attention to sentences and at a micro level to words in the sentences.
Next, we observe that while rendering the output, once the model pays attention to a field (say, occupation) it needs to stay on this field for a few timesteps (till all the occupations are produced in the output).We refer to this as the stay on behavior.Further, we note that once the tokens of a field are referred to, they are usually not referred to later.For example, once all the occupations have been listed in the output we will never visit the occupation field again because there is nothing left to say about it.We refer to this as the never look back behavior.To model the stay on behaviour, we introduce a forget (or remember) gate which acts as a signal to decide when to forget the current field (or equivalently to decide till when to remember the current field).To model the never look back behaviour we introduce a gated orthogonalization mechanism which ensures that once a field is forgotten, subsequent field context vectors fed to the decoder are orthogonal to (or different from) the previous field context vectors.
We experiment with the WIKIBIO dataset (Lebret et al., 2016) which contains around 700K {infobox, description} pairs and has a vocabulary of around 400K words.We show that the proposed model gives a relative improvement of 21% and 20% as compared to current state of the art models (Lebret et al., 2016;Mei et al., 2016) on this dataset.The proposed model also gives a relative improvement of 10% as compared to the basic seq2seq model.Further, we introduce new datasets for French and German on the same lines as the English WIKIBIO dataset.Even on these two datasets, our model outperforms the state of the art methods mentioned above.

Related work
Natural Language Generation has always been of interest to the research community and has received a lot of attention in the past.The approaches for NLG range from (i) rule based approaches (e.g., (Dale et al., 2003;Reiter et al., 2005;Green, 2006;Galanis and Androutsopoulos, 2007;Turner et al., 2010)) (ii) modular statistical approaches which divide the process into three phases (planning, selection and surface realization) and use data driven approaches for one or more of these phases (Barzilay and Lapata, 2005;Belz, 2008;Angeli et al., 2010;Kim and Mooney, 2010;Konstas and Lapata, 2013) (iii) hybrid approaches which rely on a combination of handcrafted rules and corpus statistics (Langkilde and Knight, 1998;Soricut and Marcu, 2006;Mairesse and Walker, 2011) and (iv) the more recent neural network based models (Bahdanau et al., 2014).
Neural models for NLG have been proposed in the context of various tasks such as machine translation (Bahdanau et al., 2014), document summarization (Rush et al., 2015;Chopra et al., 2016), paraphrase generation (Prakash et al., 2016), image captioning (Xu et al., 2015), video summarization (Venugopalan et al., 2014), query based document summarization (Nema et al., 2017) and so on.Most of these models are data hungry and are trained on large amounts of data.On the other hand, NLG from structured data has largely been studied in the context of small datasets such as WEATHERGOV (Liang et al., 2009), ROBOCUP (Chen and Mooney, 2008), NFL RECAPS (Barzilay and Lapata, 2005), PRODIGY-METEO (Belz and Kow, 2009) and TUNA Challenge (Gatt and Belz, 2010).Recently Mei et al. (2016) proposed RNN/LSTM based neural encoder-decoder models with attention for WEATHERGOV and ROBOCUP datasets.
Unlike the datasets mentioned above, the biography dataset introduced by Lebret et al. (2016) is larger (700K {table, descriptions} pairs) and has a much larger vocabulary (400K words as opposed to around 350 or fewer words in the above datasets).Further, unlike the feed-forward neural network based model proposed by (Lebret et al., 2016) we use a sequence to sequence model and introduce components to address the peculiar characteristics of the task.Specifically, we introduce neural components to address the need for attention at two levels and to address the stay on and never look back behaviour required by the decoder.Kiddon et al. (2016) have explored the use of checklists to track previously visited ingredients while generating recipes from ingredients.Note that two-level attention mechanisms have also been used in the context of summarization (Nallapati et al., 2016), document classification (Yang et al., 2016), dialog systems (Serban et al., 2016), etc.However, these works deal with unstructured data (sentences at the higher level and words at a lower level) as opposed to structured data in our case.

Proposed model
As input we are given an infobox I = {(g i , k i )} M i=1 , which is a set of pairs (g i , k i ) where g i corresponds to field names and k i is the sequence of corresponding values and M is the total number of fields in I.For example, (g = occupation, k = actor, writer, director) could be one such pair in this set.Given such an input, the task is to generate a description y = y 1 , y 2 , . . ., y m containing m words.A simple solution is to treat the infobox as a sequence of fields followed by the values corresponding to the field in the order of their appearance in the infobox.For example, the infobox could be flattened to produce the following input sequence (the words in bold are field names which act as delimiters) [Name] John Doe [Birth Date] 19 March 1981 [Nationality] Indian .....
The problem can then be cast as a seq2seq generation problem and can be modeled using a standard neural architecture comprising of three components (i) an input encoder (using GRU/LSTM cells), (ii) an attention mechanism to attend to important values in the input sequence at each time step and (iii) a decoder to decode the output one word at a time (again, using GRU/LSTM cells).However, this standard model is too generic and does not exploit the specific characteristics of this task.We propose additional components, viz., (i) a fused bifocal attention mechanism which operates on fields (macro) and values (micro) and (ii) a gated orthogonalization mechanism to model stay on and never look back behavior.

Fused Bifocal Attention Mechanism
Intuitively, when a human writes a description from a table she keeps track of information at two levels.At the macro level, it is important to decide which is the appropriate field to attend to next and at a micro level (i.e., within a field) it is important to know which values to attend to next.To capture this behavior, we use a bifocal attention mechanism as described below.Macro Attention: Consider the i-th field g i which has values k i = (w 1 , w 2 , ..., w p ).Let h g i be the representation of this field in the infobox.This representation can either be (i) the word embedding of the field name or (ii) some function f of the values in the field or (iii) a concatenation of (i) and (ii).The function f could simply be the sum or average of the embeddings of the values in the field.Alternately, this function could be a GRU (or LSTM) which treats these values within a field as a sequence and computes the field representation as the final representation of this sequence (i.e., the representation of the last timestep).We found that bidirectional GRU is a bet- for all the M fields we compute an attention over the fields (macro level).
where s t−1 is the state of the decoder at time step t − 1. U g , V g and v g are parameters, M is the total number of fields in the input, c g t is the macro (field level) context vector at the t-th time step of the decoder.Micro Attention: Let h w j be the representation of the j-th value in a given field.This representation could again either be (i) simply the embedding of this value (ii) or a contextual representation computed using a function f which also considers the other values in the field.For example, if (w 1 , w 2 , ..., w p ) are the values in a field then these values can be treated as a sequence and the representation of the j-th value can be computed using a bidirectional GRU over this sequence.Once again, we found that using a bi-GRU works better then simply using the embedding of the value.Once we have such a representation computed for all values across all the fields, we compute the attention over these values (micro level) as shown below : where s t−1 is the state of the decoder at time step t − 1. U w , V w and v w are parameters, W is the total number of values across all the fields.Fused Attention: Intuitively, the attention weights assigned to a field should have an influence on all the values belonging to the particular field.To ensure this, we reweigh the micro level attention weights based on the corresponding macro level attention weights.In other words, we fuse the attention weights at the two levels as: (4) where F (j) is the field corresponding to the j-th value, c w t is the macro level context vector.

Gated Orthogonalization for Modeling Stay-On and Never Look Back behaviour
We now describe a series of choices made to model stay-on and never look back behavior.We first begin with the stay-on property which essentially implies that if we have paid attention to the field i at timestep t then we are likely to pay attention to the same field for a few more time steps.For example, if we are focusing on the occupation field at this timestep then we are likely to focus on it for the next few timesteps till all relevant values in this field have been included in the generated description.In other words, we want to remember the field context vector c g t for a few timesteps.One way of ensuring this is to use a remember (or forget) gate as given below which remembers the previous context vector when required and forgets it when it is time to move on from that field.
where W t f , W g f , b f are parameters to be learned.The job of the forget gate is to ensure that c t is similar to c t−1 when required (i.e., by learning f t → 1 when we want to continue focusing on the same field) and different when it is time to move on (by learning that f t → 0).
Next, the never look back property implies that once we have moved away from a field we are unlikely to pay attention to it again.For example, once we have rendered all the occupations in the generated description there is no need to return back to the occupation field.In other words, once we have moved on (f t → 0), we want the successive field context vectors c g t to be very different from the previous field vectors c t−1 .One way of ensuring this is to orthogonalize successive field vectors using where < a, b > is the dot product between vectors a and b.The above equation essentially subtracts the component of c g t along c t−1 .γ t is a learned parameter which controls the degree of orthogonalization thereby allowing a soft orthogonalization (i.e., the entire component along c t−1 is not subtracted but only a fraction of it).The above equation only ensures that c g t is soft-orthogonal to c t−1 .Alternately, we could pass the sequence of context vectors, c 1 , c 2 , ..., c t generated so far through a GRU cell.The state of this GRU cell at each time step would thus be aware of the history of the field vectors till that timestep.Now instead of orthogonalizing c g t to c t−1 we could orthogonalize c g t to the hidden state of this GRU at time-step t − 1.In practice, we found this to work better as it accounts for all the field vectors in the history instead of only the previous field vector.
In summary, Equation 7 provides a mechanism for remembering the current field vector when appropriate (thus capturing stay-on behavior) using a remember gate.On the other hand, Equation 8 explicitly ensures that the field vector is very different (soft-orthogonal) from the previous field vectors once it is time to move on (thus capturing never look back behavior).The value of c g t computed in Equation 8 is then used in Equation 7. The c t (macro) thus obtained is then concatenated with c w t (micro) and fed to the decoder (see Fig. 2)

Experimental setup
We now describe our experimental setup:

Datasets
We use the WIKIBIO dataset introduced by Lebret et al. (2016).It consists of 728, 321 biography articles from English Wikipedia.A biography article corresponds to a person (sportsman, politician, historical figure, actor, etc.).Each Wikipedia article has an accompanying infobox which serves as the structured input and the task is to generate the first sentence of the article (which typically is a one-line description of the person).We used the same train, valid and test sets which were made publicly available by Lebret et al. (2016).
We also introduce two new biography datasets, one in French and one in German.These datasets were created and pre-processed using the same procedure as outlined in Lebret et al. (2016).Specifically, we extracted the infoboxes and the first sentence from the corresponding Wikipedia article.As with the English dataset, we split the French and German datasets randomly into train (80%), test (10%) and valid (10%).The French and German datasets extracted by us has been made publicly available. 2The number of examples was 170K and 50K and the vocabulary size was 297K and 143K for French and German respectively.Although in this work we focus only on generating descriptions in one language, we hope that this dataset will also be useful for developing models which jointly learn to generate descriptions from structured data in multiple languages.

Models compared
We compare with the following models: 1. (Lebret et al., 2016): This is a conditional language model which uses a feed-forward neural network to predict the next word in the description conditioned on local characteristics (i.e., BLEU-4 NIST-4 ROUGE-4 (Lebret et al., 2016) 34.70 7.98 25.80 (Mei et al., 2016) 35  (Bahdanau et al., 2014).Further, to deal with the large vocabulary (∼400K words) we use a copying mechanism as a postprocessing step.Specifically, we identify the time steps at which the decoder produces unknown words (denoted by the special symbol UNK).For each such time step, we look at the attention weights on the input words and replace the UNK word by that input word which has received maximum attention at this timestep.This process is similar to the one described in (Luong et al., 2015).Even Lebret et al. (2016) have a copying mechanism tightly integrated with their model.

Hyperparameter tuning
We tuned the hyperparameters of all the models using a validation set.As mentioned earlier, we used a bidirectional GRU cell as the function f for computing the representation of the fields and the values (see Section 3.1).For all the models, we experimented with GRU state sizes of 128, 256 and 512.The total number of unique words in the corpus is around 400K (this includes the words in the infobox and the descriptions).Of these, we retained only the top 20K words in our vocabulary (same as (Lebret et al., 2016)).We initialized the embeddings of these words with 300 dimensional Glove embeddings (Pennington et al., 2014).We used Adam (Kingma and Ba, 2014) with a learning rate of 0.0004, β 1 = 0.9 and β 2 = 0.999.We trained the model for a maximum of 20 epochs and used early stopping with the patience set to 5 epochs.

Results and Discussions
We now discuss the results of our experiments.

Comparison of different models
Following Lebret et al. (2016), we used BLEU-4, NIST-4 and ROUGE-4 as the evaluation metrics.We first make a few observations based on the results on the English dataset (Table 1).The basic seq2seq model, as well as the model proposed by Mei et al. (2016), perform better than the model proposed by Lebret et al. (2016).Our final model with bifocal attention and gated orthogonalization gives the best performance and does 10% (relative) better than the closest baseline (basic seq2seq) and 21% (relative) better than the current state of the art method (Lebret et al., 2016).In Table 2, we show some qualitative examples of the output generated by different models.

Human Evaluations
To make a qualitative assessment of the generated sentences, we conducted a human study on a sample of 500 Infoboxes which were sampled from English dataset.The annotators for this task were undergraduate and graduate students.For each of these infoboxes, we generated summaries using the basic seq2seq model and our final model with bifocal attention and gated orthogonalization.
For each description and for each model, we asked three annotators to rank the output of the systems based on i) adequacy (i.e.does it capture relevant information from the infobox), (ii) fluency (i.e.grammar) and (iii) relative preference (i.e., which of the two outputs would be preferred).Overall the average fluency/adequacy (on a scale of 5) for basic seq2seq model was 4.04/3.6 and 4.19/3.9for our model respectively.The results from Table 3 suggest that in general gated orthogonalization model performs better than the basic seq2seq model.Additionally, annotators were asked to verify if the generated summaries look natural (i.e, as if they were generated by humans).In 423 out of 500 cases, the annotators said "Yes" suggesting that gated orthogonalization model indeed produces good descriptions.

Performance on different languages
The results on the French and German datasets are summarized in Tables 4 and 5 respectively.Note that the code of (Lebret et al., 2016) is not publicly available, hence we could not report numbers  Table 5: Comparison of different models on the German WIKIBIO dataset of the French and German datasets.

Visualizing Attention Weights
If the proposed model indeed works well then we should see attention weights that are consistent with the stay on and never look back behavior.
To verify this, we plotted the attention weights in cases where the model with gated orthogonalization does better than the model with only bifocal attention.Figure 3 shows the attention weights corresponding to infobox in Figure 4. Notice that the model without gated orthogonalization has attention on both name field and article title while rendering the name.The model with gated orthogonalization, on the other hand, stays on the name   Due to lack of space, we do not show similar plots for French and German but we would like to mention that, in general, the differences between the attention weights learned by the model with and without gated orthogonalization were more pronounced for the French/German dataset than the English dataset.This is in agreement with the results reported in Table 4 and 5 where the improvements given by gated orthogonalization are more for French/German than for English.

Out of domain results
What if the model sees a different type of person at test time?For example, what if the training data does not contain any sportspersons but at test time we encounter the infobox of a sportsperson.This is the same as seeing out-of-domain data at test time.Such a situation is quite expected in the products domain where new products with new features (fields) get frequently added to the catalog.We were interested in three questions here.First, we wanted to see if testing the model on outof-domain data indeed leads to a drop in the performance.For this, we compared the performance of our best model in two scenarios (i) trained on data from all domains (including the target domain) and tested on the target domain (sports, arts) and (ii) trained on data from all domains except the target domain and tested on the target domain.
Comparing rows 1 and 2 of   amount of data from the target domain to fine tune a model trained on the out of domain data.We observe that even with very small amounts of target domain data the performance starts improving significantly (see rows 3 and 4 of Table 6).Note that if we train a model from scratch with only limited data from the target domain instead of finetuning a model trained on a different source domain then the performance is very poor.In particular, training a model from scratch with 10K training instances we get a BLEU score of 16.2 and 28.4 for arts and sports respectively.Finally, even though the actual words used for describing a sportsperson (footballer, cricketer, etc.) would be very different from the words used to describe an artist (actor, musician, etc.) they might share many fields (for example, date of birth, occupation, etc.).
As seen in Figure 6 (attention weights corresponding to the infobox in Figure 5), the model predicts the attention weights correctly for common fields (such as occupation) but it is unable to use the right vocabulary to describe the occupation (since it has not seen such words frequently in the training data).However, once we fine tune the model with limited data from the target domain we see that it picks up the new vocabulary and produces a correct description of the occupation.

Conclusion
We present a model for generating natural language descriptions from structured data.To ad-dress specific characteristics of the problem we propose neural components for fused bifocal attention and gated orthogonalization to address stay on and never look back behavior while decoding.
Our final model outperforms an existing state of the art model on a large scale WIKIBIO dataset by 21%.We also introduce datasets for French and German and demonstrate that our model gives state of the art results on these datasets.Finally, we perform experiments with an out-of-domain model and show that if such a model is fine-tuned with small amounts of in domain data then it can give an improved performance on the target domain.
Given the multilingual nature of the new datasets, as future work, we would like to build models which can jointly learn to generate natural language descriptions from structured data in multiple languages.One idea is to replace the concepts in the input infobox by Wikidata concept ids which are language agnostic.A large amount of input vocabulary could thus be shared across languages thereby facilitating joint learning.

Figure 1 :
Figure 1: Sample Infobox with description : V. Balakrishnan (born 1943 as Venkataraman Balakrishnan) is an Indian theoretical physicist who has worked in a number of fields of areas, including particle physics, many-body theory, the mechanical behavior of solids, dynamical systems, stochastic processes, and quantum dynamics.

Figure 3 :
Figure 3: Comparison of the attention weights and descriptions produced for Infobox in Figure 4

Figure 5 :
Figure 5: Wikipedia Infobox for Mark Tobey With fine tuning with 5K in-domain data.

Figure 6 :
Figure 6: Comparison of the attention weights and descriptions (see highlighted boxes) produced by an out-of-domain model with and without fine tuning for the Infobox in Figure 5

Table 2 :
Examples of generated descriptions from different models.For the last two examples, name generated by Basic Seq2Seq model is incorrect because it attended to preceded by field.

Table 4 :
Comparison of different models on the French WIKIBIO dataset

Table 6 :
Out of domain results

Table 6
we observed a significant drop in the performance.Note that the numbers for sports domain in row 1 are much better than the Arts domain because roughly 40% of the WIKIBIO training data contains sportspersons.Next, we wanted to see if we can use a small