Controlling Contents in Data-to-Document Generation with Human-Designed Topic Labels

We propose a data-to-document generator that can easily control the contents of output texts based on a neural language model. Conventional data-to-text model is useful when a reader seeks a global summary of data because it has only to describe an important part that has been extracted beforehand. However, because depending on users, it differs what they are interested in, so it is necessary to develop a method to generate various summaries according to users’ interests. We develop a model to generate various summaries and to control their contents by providing the explicit targets for a reference to the model as controllable factors. In the experiments, we used five-minute or one-hour charts of 9 indicators (e.g., Nikkei225), as time-series data, and daily summaries of Nikkei Quick News as textual data. We conducted comparative experiments using two pieces of information: human-designed topic labels indicating the contents of a sentence and automatically extracted keywords as the referential information for generation.

Over the past several years, end-to-end neural language generation models have successfully been applied to versatile data-to-text tasks, because they can generate fluent texts without task-specific knowledge and resources.
However, it has also been pointed out that texts generated by neural models suffer from low diversity in expressions (Yang et al., 2019). Especially on the data-to-text tasks, since they are developed under the assumption that the important contents could be uniquely determined, previous methods did not focus on controlling the contents in terms of user's interests.
However, each user may expect different contents in a summary depending on what they are interested in, and thus it is appealing to develop a method to generate various summaries which reflect user's interests.
This paper investigates a method for guiding data-to-document generation in the finance domain, by referring to a sequence of additional information for input financial data. Generating documents consisting of multiple sentences involves an inherent challenge in content selection and ordering (Reiter and Dale, 1997), because one can produce a large variety of documents for specific input data, depending on a focus, intent, readers' interest, etc. Therefore, it is essential for document generation systems to have an additional mechanism to select and order the contents to be represented.
We introduce and empirically compare two types of topic labels, both of which are intended to denote clause-level contents and their orders. One is topical keywords automatically extracted from domain texts (Rose et al., 2010), which was applied for the story generation by using as the contents of the story (Yao et al., 2018).
The other is manually defined topic labels. As our target domain is finance, major topics mentioned in documents are restricted to market indices such as Dow Jones Industrial Average (DJI), Nikkei 225, or foreign exchange rates, etc. We devised a closed set of domain-specific labels by investigating financial news articles. In the experiments on generating daily summaries of financial markets, we will empirically show the effectiveness of topic labels and potential advantages/disadvantages of this approach.

Related study
Controllability of text generation has been an intensive research focus recently. Examples include suggestive content control such as tense, sentiment, gender, or automatically learned hidden states (Hu et al., 2017;Juraska and Walker, 2018;Bau et al., 2019). Another series of work is focused on controlling surface textual features such as length, descriptiveness and politeness Sennrich et al., 2016;Kikuchi et al., 2016;Ficler and Goldberg, 2017;Shen et al., 2017;Prabhumoye et al., 2018). The target of these previous methods is on controlling generic contentindependent features of texts. That is, they aim at varying surface strings while preserving main information content. Wiseman et al. (2018) proposed a neural model that generates diverse texts by learning templates. They control diversity through templates rather than contents or the order of them. The present work is more closely related to methods for controlling topical content by using automatically extracted or human-designed keywords (Wang et al., 2016;Yao et al., 2017Yao et al., , 2018Miao et al., 2018). Our method resembles the idea of using keywords to control topics of sentences and their orders, but it primarily focuses on describing given data and uses topic labels as auxiliary information. We will empirically attest added effects of introducing topic labels in the data-to-document scenario.
Besides, Gkatzia et al. (2017) and Portet et al. (2009) proposed non-neural language generation models for the data-to-text task with higher controllability on the output. They assumed that the important contents and their descriptions are determined primarily by experts, and their models do not allow users to select the contents directly.
To the best of our knowledge, no previous research tackled a problem with controllability of the content in the data-to-document task.
We believe that the contribution of this paper is the followings: First, we propose explicitly content-controllable data-to-document generator that uses additional clause information. Ex-periments show the fluency and fidelity of the generated document in terms of BLEU and humanevaluation. Secondly, compared the generated documents between with human-designed labels and automatically extracted keywords, humandesigned labels are more useful as the ease of understanding.

Generation of Market Comments
Our task is to generate summaries of financial markets. The input is a set of financial time-series data, such as DJI, Nikkei 225, and JPY/USD exchange rate. The output is a sequence of sentences describing movements of the financial data and their relationships.
The overview of our model is illustrated in Figure 1. In the following, we first describe our design of topic labels, then describe our data-to-text model with topic labels.

Topic Labels
Topic labels are defined as clause-level topics for aiming to guide a sequence of contents to be output. We empirically compare two methods to obtain topic labels: automatically extracted keywords and human-designed labels.

Automatically extracted keywords
This is a straightforward strategy to obtain the labels as the topic of sentences. We use RAKE (Rose et al., 2010) algorithm, which builds document graphs and weights the importance of each word combining several word-level and graph-level criteria to extract the keywords. Using such an automatic keywords extraction system has an advantage on the cost of human annotation while the extracted keywords sometimes do not express writers' intent. For example, RAKE often outputs the word "market" or "observation" as keywords, but they are not appropriate as the topic labels because of the lack of precise information-a system would be unable to understand which market, e.g., Nikkei or DJI, or what kinds of observations, e.g., the growth rate of stock prices or the trends of investments, when generating a text considering these labels.

Human-designed topic labels
We devised a set of topic labels by observing target sentences in the training data and what they often refer to, especially for Nikkei Quick News (NQN).
A topic label denotes the objects mentioned in documents and is defined as a triple, each element [Others] is given when these subcategories do not apply. Table 1 shows all atomic labels, and Table 2 illustrates an example of sentence and its topic labels. Note that a topic label is given to a clause, which means one sentence may have multiple topic labels. On the annotated data, the average number of labels for each sentence is 2.0.
We designed the topic labels aiming to make it easy for a user to control the contents. News articles often refer to trends of each market indices and also their relationships focusing on the general market movements. This means that sometimes the generated summary would not contain the contents expected by the user, especially when the markets attended by the user are not common ones. Therefore, we developed topic labels with the concrete names of market indices. The topic labels help a user to generate articles which have sufficient content by inputting market names as their interest.
Besides, the granularity of topics is an important factor to design the topic labels. Too rough topics could lead the vagueness of meaning of the topic while too detailed topics would be difficult for some users and take time to annotate. We designed our topic labels in a hierarchical structure to make them adaptive to different levels of granularity.
Human-designed labels are inferior in terms of construction cost, but expected to be superior to automatically extracted keywords in terms of interpretability as discussed later.

Encoder-Decoder with Topic Labels
Our model is an extended encoder-decoder that conditions on a document topic label sequence and previous sentences, in addition to financial data consisting of multiple numerical sequences. To make learning and generation simpler, our decoder generates each sentence separately, by encoding the sentence that was generated last. The entire neural-network architecture of our encoderdecoder model is shown in Figure 1.
Assume that we have generated i −1 sentences in an article and generate the next i-th sentence. Let . . , n) be the j-th sentence in an article, where each w j t is a word. We also use w j t to denote embedding of w j t . In addition, in the following, W * and b * are a weight matrix and bias terms in the model parameters, respectively.

Encoders
We employ three encoders for encoding financial indices, previous generated sentences, and topic labels. To generate a next sentence s i , from these  and selling pressure prevailed in the (Japanese stock) markets. encoders we first obtain three vectors, h market , h i art , and h topic , which encode financial data, previous sentences, and all topic labels on the article, respectively. Note that h topic encodes a sequence of all topics on the article. We also obtain a vector encoding the topic labels of the current sentence, denoted by ξ i , from the topic labels encoder.

Financial data encoder
As the encoder of financial data, we follow Murakami et al. (2017). Given L numerical sequences x 1 , . . . , x L , which are numerical sequences of financial indicators (e.g., Nikkei stock average and the foreign exchange rate of Japanese yen). We concatenate these vectors and feed it to a 3-layer MLP to obtain a single vector h market : We use different MLPs to convert numerical sequences into vectors. Note that x l consists of x l short and x l long , which are short-term and long-term normalized data of x l . x l short is composed of the previous prices within one trading day and x l long is composed of the closing prices of the seven pre-ceding trading days.

Previous sentences encoder
The hidden state h i art representing previous sentences of s i is obtained from p preceding sentences, s i−p , . . . , s i−1 .
We input embeddings w i−p 1 , . . . , w i−p |s i−p | , . . . , w i−1 1 , . . . , w i−1 |s i−1 | to long short-term memory (LSTM) cells and obtain h art i after passing its terminal hidden state to a linear layer, where |s j | denotes the number of tokens in s j .

Topic label encoder
The outputs of this encoder are h topic , which encodes the topic sequence of the article, and ξ i , corresponding to the embedding of topics assigned to the target (i-th) sentence s i . The reason why we encode the document-level topic sequence, rather than sentence topics only, is to make ξ i contextsensitive, by which we expect the output sentence, conditioned on ξ i , to reflect the position of sentence topics on the entire document topics.
We use a bidirectional LSTM network (see the left part of Figure 1) to encode the sequence of document topics. As an input, we first concatenate all topic labels of sentences with a token </s>. Then h topic is obtained from the outputs of this LSTM, by concatenating the outputs at the end tokens of both directions. These are denoted by − → η K and ← − η 1 in Figure 1, where K = ∑ j φ( j), the length of the document-level topic sequence. φ( j) denotes the length of assigned topics for s j , including the last topic label </s>.
We then extract topic embedding corresponding to the topic labels of s i as topic embedding ξ i . We do this by summing all outputs of LSTMs in a span corresponding to the current sentence excluding </s>. Let θ i k be the k-th topic in s i . The bi-LSTM introduced above transforms this input into a vector , by concatenating the outputs of LSTMs of both directions. Then ξ i is obtained by summing the outputs in the span, followed by a linear layer: in which ι and κ are the indices corresponding to θ i 1 and θ i φ(i)−1 , the start and end topic labels for the i-th sentence. Formally, ι = ∑ i−1 j=1 φ( j) + 1 and κ = ι + φ(i) − 1.

Decoder
Our decoder is another LSTM conditioned on the outputs of three encoders introduced above. To initialize the decoder, we first concatenate three outputs of encoders and apply a linear layer: Note that h topic encodes the entire documentlevel topics, not the topics for the target sentence only. To make the output sentence more relevant to those target topics, we feed ξ i to the input of the decoder at every step, by concatenating it with the original input vector w j t . While this would allow the contents of output sentence to follow the given local topics, the sentence should also reflect the information of global topic sequence, e.g., the relative position of the target sentence in the article. A natural way to encode such context in the decoder is the attention mechanism (Luong et al., 2015), which we apply to the outputs of topic label encoder η t , as well as the outputs of financial data encoder m l , to capture the important resources relevant to the current sentence.
We obtain the context vectors c market t and c topic t for attending the financial data and topic labels from the output of decoder LSTM, and concatenate them before the softmax layer: The context vector c market t is obtained by a bilinear attention (Luong et al., 2015): is obtained similarly:

Experimental Settings
Each example in our dataset is a pair of aligned time-series data and a corresponding document. We obtained documents by retrieving daily summaries from NQN, which describes market trends in Japanese, as well as aligned time-series data, from Thomson Reuters DataScope Select1. Dividing by periods, we obtained 864, 122, and 124 documents (9,337, 1,215, and 1,237 sentences) for train/valid/test sets, respectively. The vocabulary size was 3,025. As topic labels, we used 91 human-designed topic labels and 818 kinds of extracted keywords by the RAKE algorithm. We preprocessed each indicator following Aoki et al. (2018), and used the same parameters for the financial data encoder. Other parameters were tuned by document-level BLEU scores on the validation set.
We compared five different documents; a document written by human writer (Gold), a document generated by our model without topic labels (NoTopicLabel) and three documents respectively generated by our models using topic labels: • HDTag3: Human-designed topic labels.
• HDTag1: Simplified Human-designed topic labels (only target of Table 1) to see the importance of other factors.
• RAKE: Two keywords extracted by RAKE.  Table 3: Result of evaluation in terms of BLEU. Scores were averaged over 5 runs. The values after ± are the standard deviations. We report both the averaged BLEU scores over all the documents (BLEU (doc)) and sentences (BLEU (sent)).
We conducted both an automatic evaluation with BLEU score in words and a human-evaluation. The human evaluation focused on the fluency and the fidelity and the correctness of each approach. For human evaluation, we sample 15 instances from the test dataset. For each of the 15 instances, evaluators are presented with 5 documents that are respectively generated by a human writer (Gold), NoTopicLabel, HDTag3, HDTag1, and RAKE. Note that NoTopicLabel does not use topic labels, while HDTag3, HDTag1, and RAKE use topic labels. The evaluators are asked to rate the documents on a 1-3 scale with respect to fidelity, correctness and fluency. Fidelity measures whether each document reflects the given topic labels. Correctness measures whether each document is faithful to the given financial data. Fluency measures the fluency of each document without regard to input data. Since the evaluation of Correctness is a complicated process which requires the reference to input numerical data, the evaluators are supposed to evaluate only the sentences that satisfy the following two conditions: (i) the sentence starts with "Nikkei stock average" or "The exchange rate of the Japanese yen"2, (ii) the sentence is labeled with only [Nikkei 225/ Actuals/ Movement] or [JPY//Movement]. Additionally, the evaluators are also asked to conduct sentence-level evaluation of fluency, in which they are presented with 5 sentences generated by the 5 methods including Gold. All the evaluations are conducted by two evaluators, and we compute the average scores for each approach.
2The original Japanese phrases are " (Nikkei stock average)" and " (The exchange rate of the yen)". Table 3 shows the BLEU scores of different approaches. The models with label information (HD-Tag3, HDTag1, and RAKE) achieved higher performances in terms of BLEU. RAKE achieved a slightly higher BLEU score than HDTag3, but the difference was not statistically significant. HD-Tag3 achieved a higher BLEU score than HDTag1. This result suggests that more informative topic labels improved the quality of generated text. In other words, careful design of topic labels helps high-quality generation, although it requires more human cost. It is also encouraging that HDTag3 is comparable to RAKE, in spite of the fact that the labels in the latter are extracted from words in the reference.

Results
The results of human-evaluation are shown in Table 4. There was no statistically-significant difference among the sentence-level fluency scores of all methods. This means that the neural-network based method has the ability to generate a fluent text at least at the sentence level. The methods with topic-label information showed a better documentlevel fluency than the one without topic labels.
Meanwhile, there was a significant difference between the document-level fluency scores of generated sentences and human-written sentences. We considered it is caused by not considering relationships among topic-labels and also by weak consideration of generated sentences. Specifically, our model possibly generates almost the same content repeatedly as the content of previously generated sentences which are not treated as input resource, and moreover our model could generate different movement descriptions about the same indicator within a document. An example is shown in Table 5, where two sentences describing the same movement of the exchange rate state the contradictory things; dropped and rose. To solve the above problems, the implementation of additional memories to keep tracking which topics have been mentioned and how topics have been mentioned is interesting avenue for future work.
Besides, we observed the correctness of RAKE is higher than that of the other models. It is not surprising, because topic labels of RAKE are words in the target documents, and the topic labels like continuously fall or rebound would directly deliver the characteristics of the input data. In comparison between the methods with human-designed labels, HDTag3 is superior to HDTag1. This result is  Table 4: Result of human-evaluation. Scores range in [1,3]. Fidelity measures whether each document reflects the given topic labels. Correctness measures whether each document is faithful to the given financial data.
Fluency measures the fluency of each document or sentence without regard to input data. Fluency (doc) is the document-level fluency, while Fluency (sent) is the sentence-level fluency.
Moreover, both models with human-designed topic labels show higher fidelity, which means the generated documents reflect the given topic labels. We speculate that the lower fidelity of RAKE is caused by the ambiguity of extracted keywords as discussed through examples in the next paragraph.
We then provide a qualitative comparison of RAKE and human-designed topic labels. Table 6 shows some output examples. As we mentioned, we found that RAKE keywords are often more ambiguous than the human-designed topic labels. This is mainly because the granularity of keywords is not properly defined. Table 6(a) shows an example, where RAKE keywords contain "high", which however does not tell which quantity is high, resulting in the wrong contents in the generated sentence. The human-designed topic labels have a higher interpretability, and the sentences generated with such topic labels are well-controlled.
The quantitative and qualitative evaluations above suggest that human-designed topic labels contribute to a better controllability backed up by high fidelity and interpretability.
Although our approach has the advantage of the controllability in generating sentences, we also found complication in terms of topic design, in particular, the definition of granularity of the topics. We found that the system often generates a wrong description when the topic labels contain a general label, such as Others and Events. These labels tend to be used as catch-all labels, resulting in diverse contents. Table 6(b) shows an example that Others leads to a longer sentence with wrong contents.
To demonstrate that we can control the contents given the same financial data, in Table 6(c), we show how a generated sentence varies by giving topic labels that are different from the actual topic labels (HDTag3Unseen). We can see that a generated sentence properly changes its contents so that it reflects the new topic labels.
(a) Sentences generated by HDTag3 and HDTag1, RAKE, for which the topic labels by RAKE are ambiguous.

Conclusion
We proposed a data-to-document generator which can be controlled by a sequence of topic labels. We compared two topic labels, the humandesigned topic label and automatically extracted keywords, and conducted experiments with a financial dataset. Our experiments empirically showed that the model using topic label information achieved higher performance in terms of BLEU and human-evaluation. Furthermore, the model using the human-designed topic labels has an advantage on controllability of the output documents without reducing BLEU scores. In addition, experiments showed that the granularity of topic labels influences the generation quality.
As future work, we will employ the network architectures which have additional memories to keep tracking which topics have been mentioned and how topics have been mentioned for high topical coherence in the sentences. In addition, future work should include reducing the inconsistency between a generated text and the actual movement of input financial indicators because even one conflict could be fatal to the reliability of the generated text.
Topic labels should also be easy to handle for human users, who actually use the system to generate a document. We also need to evaluate topic labels in terms of the easiness of use.