Stick to the Facts: Learning towards a Fidelity-oriented E-Commerce Product Description Generation

Different from other text generation tasks, in product description generation, it is of vital importance to generate faithful descriptions that stick to the product attribute information. However, little attention has been paid to this problem. To bridge this gap we propose a model named Fidelity-oriented Product Description Generator (FPDG). FPDG takes the entity label of each word into account, since the product attribute information is always conveyed by entity words. Specifically, we first propose a Recurrent Neural Network (RNN) decoder based on the Entity-label-guided Long Short-Term Memory (ELSTM) cell, taking both the embedding and the entity label of each word as input. Second, we establish a keyword memory that stores the entity labels as keys and keywords as values, and FPDG will attend to keywords through attending to their entity labels. Experiments conducted a large-scale real-world product description dataset show that our model achieves the state-of-the-art performance in terms of both traditional generation metrics as well as human evaluations. Specifically, FPDG increases the fidelity of the generated descriptions by 25%.


Introduction
The effectiveness of automatic text generation has been proved in various natural language processing applications, such as neural machine translation (Luong et al., 2015;, dialogue generation (Tao et al., 2018a;Hu et al., 2019), and abstractive text summarization Gao et al., 2019). One such task, product description generation has also attracted considerable attention (Lipton et al., 2015;Wang et al.,  2017; Zhang et al., 2019a). An accurate and attractive description not only helps customers make an informed decision but also improves the likelihood of purchase. Concretely, the product description generation task takes several keywords describing the attributes of a product as input, and then outputs fluent and attractive sentences that highlight the feature of this product.
There is one intrinsic difference between product description generation and other generation tasks, such as story or poem generation Yao et al., 2019); the generated description has to be faithful to the product attributes. An example case is shown in Table 1, where the good product description matches the input information, while the bad description mistakes the brand and the style of the jeans. Our preliminary study reveals that 48% of the outputs from a state-ofthe-art sequence-to-sequence system suffer from this problem. In the real-world e-commerce product description generation system, generating text with unfaithful information is unacceptable. A wrong brand name will damage the interests of advertisers, while a wrong product category might break the law by misleading consumers. Any kind of unfaithful description will bring huge economic losses to online platforms.
Though of great importance, no attention has been paid to this problem.
Existing product description generators include Zhang et al., 2019b), where they focus on generating personalized descriptions and patterncontrolled descriptions, respectively.
In this paper, we address the fidelity problem in generation tasks by developing a model named Fidelity-oriented Product Description Generator (FPDG), which takes keywords about the product attributes as inputs. Since product attribute information is always conveyed by entity words (up to 89.72% in input keywords), FPDG generates faithful product descriptions by taking into account the entity label of each word. Specifically, first, we propose an Entity-label-guided Long Short-Term Memory (ELSTM) as the cell in the decoder RNN, which takes the entity label of an input word, as well as the word itself as input. Then, we establish a keyword memory storing the input keywords with their entity labels. In each decoding step, the current hidden state of the ELSTM focuses on proper words in this keyword memory in regard to their entity categories. We also collect a largescale real-world product description dataset from one of the largest e-commerce platforms in China. Extensive experiments conducted on this dataset show that FPDG outperforms the state-of-the-art baselines in terms of traditional generation metrics and human evaluations. Specifically, FPDG greatly improves the fidelity of generated description by 24.61%.
To our best knowledge, we are the first to explore the fidelity problem of product description generation. Besides, to tackle this problem, we propose an ELSTM and a keyword memory to incorporate entity label information, so as to generate more accurate descriptions.

Related Work
We detail related work on text generation, entityrelated generation, and product description.
Text generation.
Recently, sequence-tosequence (Seq2Seq) neural network models have been widely used in NLG approaches. Their effectiveness has been demonstrated in a variety of text generation tasks, such as neural machine translation (Luong et al., 2015;Bahdanau et al., 2014;Wu et al., 2016), abstractive text summarization (See et al., 2017a;Hsu et al., 2018;Chen and Bansal, 2018), dialogue generation (Tao et al., 2018a;Xing et al., 2017), etc. Along another line, there are also works based on an attention mechanism. Vaswani et al. (2017) proposed a Transformer architecture that utilizes the self-attention mechanism and has achieved state-of-the-art results in neural machine translation. Since then, the attention mechanism has been used in a variety of tasks (Devlin et al., 2018;Fan et al., 2018;Zhou et al., 2018).
Entity-related generation. Named entity recognition (NER) is a fundamental component in language understanding and reasoning (Greenberg et al., 2018;Katiyar and Cardie, 2018). In (Ji et al., 2017), they proved that adding entity related information can improve the performance of language modeling. Building upon this work, in (Clark et al., 2018), they combined entity context with previous-sentence context, and demonstrated the importance of the latter in coherence test. Another line of related work generates recipes using neural networks to track and update entity representations (Bosselut et al., 2018). Different from the above works, we utilize entity labels as supplementary information to assist decoding in the text generation task.
Product descriptions. Quality product descriptions are critical for providing a competitive customer experience in an e-commerce platform. Due to its importance, automatically generating the product description has attracted considerable interests. Initial works include (Wang et al., 2017), which incorporates statistical methods with the template to generate product descriptions. With the development of neural networks, ) explored a new way to generate personalized product descriptions by combining the power of neural networks and a knowledge base. (Zhang et al., 2019b) proposed a pointer-generator neural network to generate product description whose patterns are controlled.
In real-world product description generation application, however, the most important prerequisite is the fidelity of generated text, and to the best of our knowledge, no research has been conducted on this so far.

Problem Formulation
FPDG takes a list of keywords X = {x 1 , ..., x T X } as inputs, where T X is the number of keywords. These keywords are all about the important at-  Figure 1: Overview of FPDG. Green denotes an entity label and purple denotes a word. We divide our model into two components: (1) The Keyword Encoder stores the word and its entity label in the token memory, and uses Self-Attention Modules (SAMs) to encode words and entity labels; (2) The Entity-based Generator generates product description based on the token memory and SAM encoders.
tributes of the product. The goal of FPDG is to generate a product descriptionŶ = {ŷ 1 , ...,ŷ TŶ } that is not only grammatically correct but also consistent with the input information, such as the brand name and the fashion style. Essentially, FPDG tries to optimize the parameters to maximize the probability P

Model
In this section, we introduce our Fidelity-oriented Product Description Generator (FPDG) model in detail. An overview of FPDG is shown in Figure 1 and can be split into two modules: (1) Keyword Encoder (See § 4.1): We first use a key-value memory to store the entity-label as key and the corresponding word as value. To better learn the interaction between words, we employ two self-attention modules from Transformer (Vaswani et al., 2017) to model the input keywords and their entity-labels separately.

Keyword Encoder
In the product description dataset, most of the input keywords are entity words (up to 89.72%), and the description text should be faithful to these entity words. Hence, we incorporate entity label information to improve the accuracy of the generated text. In this section, we introduce how to embed input keywords with entity label information.
To begin with, we use an embedding matrix e to map a one-hot representation of each word in x i into a high-dimensional vector space. Since our input keywords have no order information, we use the Self-Attention Module (SAM) from Transformer (Vaswani et al., 2017) to model the temporal interactions between the words, instead of RNN-based encoder. We use three fullyconnected layers to project e(x i ) into three spaces, i.e., the query q i = F q (e(x i )), the key k i = F k (e(x i )) and the value v i = F v (e(x i )). The attention module then takes q i to attend to each k · , and uses these attention distribution results α i,· ∈ R T X as weights to obtain the weighted sum of v i , as shown in Equation 2. Next, we add the original word representation e(x i ) on β i as the residential connection layer, as shown in Equation 3: where α i,j denotes the attention weight of the i-th word on the j-th word. Finally, we apply a feedforward layer onĥ i to obtain the final word representation h i : where W 1 , W 2 , b 1 , b 2 are all trainable parameters. We refer to the above process as: In the meantime, we leverage the e-commerceadapted AliNER 1 to label each word in the input such as "brand name" and "color". Non-entity words are labeled as "normal word". We denote c i Entity-Label-LSTM Figure 2: The structure of ELSTM, which is a hybrid of three LSTMs.
as the entity label for the i-th word. To better learn the interaction between these entity labels, we use a second SAM, taking c · as input, and obtain the final representation for c i as m i : We also propose a key-value keyword memory that stores the label results as shown in Figure 1. The keys in the keyword memory are T C representations of different entity-labels, where T C is the number of entity categories. The values in the memory are self-attention representations of words that belong to each category. We denote the k-th entity-label in the keyword memory as c k , and the i-th word belonging to this category as e(x k i ). This keyword memory will be incorporated into the generation process in §4.2.3.

Entity-based Generator
To incorporate entity label information into generation process, we propose a modified version of LSTM, named Entity-label-guided Long Short-Term Memory (ELSTM). We first introduce EL-STM and then introduce the RNN decoder based on ELSTM. Figure 3, ELSTM consists of three LSTMs, two Word-LSTMs and one Entity-label-LSTM. The hidden states of the two Word-LSTMs are integrated together, forming w h t , while the hidden state of Entity-Label-LSTM is l h t . The word hidden state w h t attends to the SAM outputs to predict the next word, while the entity label hidden state l h t is used to attend to the keyword memory and predict entity label of next word. In other words, ELSTM is a hybrid of three LSTMs with a deeper interaction between these cells. Overall, ELSTM takes two variables as input, the embedding of an input word e(y t ) and its entity label m t .

As shown in
The structure of each LSTM in ELSTM is the same as original LSTM, thus, is omitted here due to space limitations. Next, we introduce the interaction in ELSTM in detail. As shown in Figure 2, Word-LSTM 0 takes the e(y t ) as input, and outputs the initial word hidden state w h t+1 : Next we calculate the entity label hidden state by Entity-Label-LSTM, taking both m t and w h t+1 as input: To further improve the accuracy of entity label prediction, we also apply another multi-layer projection, incorporating entity label context vector c m t (introduced in §16) to obtain polished entity hidden state l h t+1 : where [, ] denotes the concatenation operation. Finally, Word-LSTM 1 takes the entity-label hidden state l h t+1 and word embedding e(y t ) as input, and outputs predicted word hidden state that contains entity label information: Note that it is possible that the predicted entitylabel hidden state is not accurate enough. Hence, we apply a gate fusion combining w h 1 t+1 with initial word hidden state w h t+1 to ensure the word hidden state quality: The fusion result is w h t+1 , i.e., the updated word hidden state.
In this way, the entity label information and word information are fully interacted in ELSTM cell, while the hidden states are updated.

Word-SAM Word-Attention
Label-Attention ELSTM Figure 3: An overview of the description generator.

ELSTM-based Attention Mechanism
Established on ELSTMs, we now have a new RNN decoder to generate descriptions, incorporating the two SAM encoders and the keyword memory. First, we apply an Bi-RNN to encode input keywords, and use its last hidden state as decoder initial hidden states, i.e., w h 0 and l h 0 . The t-th decoding step is calculated as: Next, similar to the traditional attention mechanism from (Bahdanau et al., 2015), we summarize the input word representation e(y . ) and entity label representation m · into the word context vector c w t and entity label vector c m t , respectively. As for how to obtain c w t and c m t , we use the two hidden states in ELSTM, i.e., w h t and l h t to attend to h . and m . , respectively. Specifically, the entity label context vector c m t is calculated as: The decoder state l h t is used to attend to each entity label representation m i , resulting in the attention distribution γ t ∈ R T X , as shown in Equation 15. Then we use the attention distribution γ t to obtain a weighted sum of the entity label representations m · as the word context vector c m t , shown in Equation 16. c w t is obtained in a similarity by using hidden state w h t attending to e(y . ), and we omit the details for brevity. This two context vectors play different parts, and is introduced in §4.2.4.

Incorporating Keyword Memory
So far, we have finished calculating the context vector. Next, we describe how to incorporate the guidance from the keyword memory. We first use the entity label hidden state to attend to the entity label keys in the keyword memory by gumbel softmax (Jang et al., 2016) (Equation 17), and then use the attention weights to obtain the weighted sum of self-attention representation of values in the memory (Equation 19): where p k = − log(− log(g k )), g k ∼ U (0, 1), τ ∈ [0, 1] is the softmax temperature. T c k denotes the number of words in the input keywords that belong to the k-th entity category. We choose gumbelsoftmax instead of regular softmax because a generated word can only belong to one entity category. In this way, FPDG first predicts the entity label of the predicted word and then uses o t+1 to store the information of words that belong to this category.

Projection Layers
Next, using a fusion gate g t , o t+1 is combined with word hidden state c w t+1 in a similar way in Equation 12 to obtain o t+1 . Finally, we obtain the final generation distribution P v over vocabulary: We concatenate the memory vector o t+1 , the word context vector c w t+1 , and the output of the decoder ELSTM w h t+1 as the input of the output projection layer.
Apart from predicting the next word, we also use the entity label hidden state l h t to predict the entity label of the next word as an auxiliary task. The distribution P e t+1 over entity categories is calculated as: We use negative log-likelihood as loss function: where λ is the weight of entity label prediction loss. The gradient descent method is employed to update all parameters and minimize this loss function.

Dataset
We collect a large-scale real-world product description dataset from one of the largest ecommerce platforms in China 2 . The inputs are keywords about the product attributes selected by sellers on the platform, and the product descriptions are written by professional experts that are good at marketing. Overall, there are 404,000 training samples, and 5,000 validation and test samples. On average, there are 10 keywords in the input, and 63 words in the product description. 89.72% inputs are entity words.

Evaluation Metrics
Following (Fu et al., 2017), we use the evaluation package of (Chen et al., 2015), which includes BLEU-1, BLEU-2, BLEU-3, BLEU-4, METEOR and ROUGE-L. BLEU is a popular machine translation metric that analyzes the co-occurrences of n-grams between the candidate and reference sentences. METEOR is calculated by generating an alignment between the words in the candidate and reference sentences, with an aim of 1:1 correspondence. ROUGE-L is a metric designed to evaluate text summarization algorithms. (Tao et al., 2018b) notes that only using the BLEU metric to evaluate text quality can be misleading. Therefore, we also evaluate our model by human evaluation. Three highly educated participants are asked to score 100 randomly sampled summaries generated by Pointer-Gen and FPDG. We chose Pointer-Gen since its performance is relatively higher than other baselines. The statistical significance of observed differences between the performance of two runs are tested using a twotailed paired t-test and is denoted using (or ) for strong significance for α = 0.01.

Comparison Methods
We first conduct an ablation study to prove the effectiveness of each module in FPDG. Model w/o ELSTM is implemented with only keys in the memory because there are no entity representations as values. Then, to evaluate the performance of our proposed dataset and model, we compare it with the following baselines: (1) Seq2Seq: The sequence-to-sequence framework (Sutskever et al., 2014) is one of the initial works proposed for the language generation task.
(2) Pointer-Gen: A sequence-to-sequence framework with pointer and coverage mechanism proposed in (See et al., 2017b).
(4) FTSum: A faithful summarization model proposed in (Cao et al., 2018), which leverages open information extraction and dependency parse technologies to extract actual fact descriptions from the source text. (5) Transformer: A network architecture that solely based on attention mechanisms, dispensing with recurrence and convolutions entirely (Vaswani et al., 2017). (6) PCPG: A pattern-controlled product description generation model proposed in (Zhang et al., 2019b). We adapt the model for our scenario.
To verify whether the performance improvement is obtained by adding additional entity label inputs, we directly concatenate the word embedding with the entity embedding as input for these baselines, denoted as "with Entity Embedding" in Table 2.

Implementation Details
We implement our experiments in Pytorch 3 on Tesla V100 GPUs 4 . The word and entity-label embedding dimensions are set to 256. The number of hidden units and the entity-label hidden size are also set to 256. All inputs were padded with zeros to a maximum keyword number of the batch. There are 36 categories of entity labels together.  λ is set to 0 in the first 500 steps, and 0.6 in the rest of the training process. We performed minibatch cross-entropy training with a batch size of 256 documents for 15 training epochs. we set the minimum encoding step to 15 and maximum decoding step size to 70. During decoding, we employ beam search with a beam size of 4 to generate a more fluent sentence. It took around 6 hrs on GPUs for training. After each epoch, we evaluated our model on the validation set and chose the best performing model for the test set. We use the Adam optimizer (Duchi et al., 2010) as our optimizing algorithm and the learning rate is 1e-3.
6 Experimental Results

Overall Performance
For research question RQ1, we examine the performance of our model and baselines in terms of BLEU, as shown in Table 2  ing the strongest baseline Pointer-Gen, by 5.22%, 6.86%, 6.79%, and 6.65% in terms of BLEU-1, BLEU-2, BLEU-3, and BLEU-4, respectively. As for human evaluation, we ask three highly educated participants to rank generated summaries in terms of fluency, informativity, and fidelity. The rating score ranges from 1 to 3 with 3 being the best. The results are shown in Table 3, where FPDG outperforms Pointer-Gen by 10.31% and 19.02% in terms of fluency and informativity, and, specifically, FPDG greatly improves the fidelity value by 24.61%. We also conduct the paired student t-test between our model and Pointer-Gen, and the result demonstrates the significance of the above results. The kappa statistics are 0.35 and 0.49, respectively, which indicates fair and moderate agreement between annotators 5 .

Ablation Study
Next, we turn to research question RQ2. We conduct ablation tests on the usage of the keyword memory and ELSTM, corresponding to FPDG w/o KW-MEM and ELSTM, respectively. The ROUGE score result is shown at the bottom of Table 2. Performances of all ablation models are worse than that of FPDG in terms of almost all metrics, which demonstrates the necessity of each module in FPDG. Specifically, ELSTM makes the greatest contribution to FPDG, improving the BLEU-1, BLEU-2 scores by 2.71% and 1.51%.

Analysis of Keyword Memory
We then address RQ3 by analyzing the entitylabel-level attention on the keyword memory. Two representative cases are shown in Figure 4. The figure in the above is the attention map when generating the word "toryburch", and the bottom figure is when generating the word "printflower". The darker the color, the higher the attention. Due to limited space, we omit irrelevant entity categories. When generating "toryburch", which is a brand name, the entity-label-level attention pays most attention to the "Brand" entity label, and when generating "flower", which is a style element, it mostly pays attention to "Element". This example demonstrates the effectiveness of the entity-label-level attention.

Analysis of ELSTM
We now turn to RQ4; whether or not the ELSTM can capture entity label information. We examine this question by verifying whether ELSTM can predict the entity label of the generated word. The accuracy of the predicted entity label is calculated in a teacher-forcing style, i.e., each ELSTM takes perfect agreement.  the ground truth entity label and word as input, and outputs the entity label of the next word. We employ recall at position k in n candidates (R n @k) as evaluation metrics. Over the whole test dataset, R 1 @36 is 64.12%, R 2 @36 is 80.86%, and R 3 @36 is 94.02%, which means ELSTM can capture the entity label information to a great extent and guide the word generation.
We also show a case study in Table 4. The description generated by Pointer-Gen introduces the Prada bag as a "clamshell bag has a good antitheft effect", which is contrary to the fact. While our model generates the faithful description: "The opening and closing design of the zipper smoothes the stroke".

Conclusion and Future Work
In this paper, we explore the fidelity problem in product description generation. To tackle this challenge, based on the consideration that product attribute information is typically conveyed by entity words, we incorporate the entity label information of each word to enable the model to have a better understanding of the words and better focus on key information. Specifically, we propose an Entity-label-guided Long Short-Term Memory (ELSTM) and a token memory to store and capture the entity label information of each word. Our model outperforms state-of-the-art methods in terms of BLEU and human evaluations by a large margin. In the near future, we aim to fully prevent the generation of unfaithful descriptions and bring FPDG online.