Learning to Generate Market Comments from Stock Prices

This paper presents a novel encoder-decoder model for automatically generating market comments from stock prices. The model first encodes both short- and long-term series of stock prices so that it can mention short- and long-term changes in stock prices. In the decoding phase, our model can also generate a numerical value by selecting an appropriate arithmetic operation such as subtraction or rounding, and applying it to the input stock prices. Empirical experiments show that our best model generates market comments at the fluency and the informativeness approaching human-generated reference texts.


Introduction
Various industries such as finance, pharmaceuticals, and telecommunications have been increasingly providing opportunities to treat various types of large-scale numerical time-series data. Such data are hard for non-specialists to interpret in detail and time-consuming even for specialists to construe. As a result, there has been a growing interest in automatically generating concise descriptions of such data, i.e., data summarization. This interest in data summarization is encouraged by the recent development of neural network-based text generation methods. Given an appropriate architecture, a neural network can generate a sentence that is mostly grammatical and semantically reasonable.
In this study, we focus on the task of generating market comments from a time-series of stock prices. We adopt an encoder-decoder model (Sutskever et al., 2014) and exploit its capability to learn to capture the behavior of the input and generate a description of it. Although encoderdecoder models can learn to do this, they need to be (1) (2) (3) (4)
(3) 11:30 Nikkei continues to fall. The closing price of the morning session decreases by 5 yen to 19,386 yen.  provided with an appropriate network-architecture and necessary information. We use Figure 1 to illustrate the characteristic problems of comment generation for time-series of stock prices. The figure shows the Nikkei Stock Average (Nikkei 225, or simply Nikkei), which is a stock market index calculated from 225 selected issues, on some consecutive trading days accompanied by the market comments made at some specific time points in the span. The first problem is that market comments do not merely describe the increase and decrease of the price. They also often describe how the price changes compared with the previous period, such as "continues to fall" in (3) of Figure 1, "turns to rise" in (2), and "rebound" in (6). Market comments sometimes describe the change in price compared with the prices in the previous week. The second problem is that market comments also contain expressions that depend on their delivery time: e.g., "opens with" in (1), "closing price of the morning session" in (3), and "beginning of the afternoon session" in (4). The third problem is that market comments typically contain numerical values, which often cannot be copied from the input prices. Such numerical values probably cannot be generated as other words are generated by the standard decoder. This difficulty can be easily understood as analogous with the difficulty of generating named entities by encoder-decoder models. To derive such values, the model needs arithmetic operations such as subtraction as in examples (3) and (6) mentioning the difference in price and rounding as in example (5).
To address these problems, we present a novel encoder-decoder model to automatically generate market comments from stock prices. To address the first problem of capturing various types of change in different time scales, the model first encodes data consisting of both short-and long-term time-series, where a multi-layer perceptron, a recurrent neural network, or a convolutional network is adopted as a basic encoder. In the decoding phase, we feed our model with the delivery time of the market comment to generate the expressions depending on time of day to address the second problem. To address the third problem regarding with numerical values mentioned in the generated text, we allow our model to choose an arithmetic operation such as subtraction or rounding instead of generating a word.
The proposed methods are evaluated on the task of generating Japanese market comments on the Nikkei Stock Average. Automatic evaluation with BLEU score (Papineni et al., 2002) and F-score of time-dependent expressions reveals that our model outperforms a baseline encoder-decoder model significantly. Furthermore, human assessment and error analysis prove that our best model generates characteristic expressions discussed above almost perfectly, approaching the fluency and the informativeness of human-generated market comments.

Related Work
The task of generating descriptions from timeseries or structured data has been tackled in various domains such as weather forecasts (Belz, 2007;Angeli et al., 2010), healthcare (Portet et al., 2009;Banaee et al., 2013b), and sports (Liang et al., 2009). Traditionally, many studies used hand-crafted rules (Goldberg et al., 1994;Dale et al., 2003;Reiter et al., 2005). On the other hand, interest has recently been growing in automatically learning a correspondence relationship from data to text and generating a description of this relationship since large-scale data in diversified formats have become easy to acquire. In fact, a data-driven approach has been extensively studied nowadays for various tasks such as image caption generation (Vinyals et al., 2015) and weather forecast generation (Mei et al., 2016b).
The task, called data-to-text or concept-to-text, is generally divided into two subtasks: content selection and surface realization. Whereas previous studies tackled the subtasks separately (Barzilay and Lapata, 2005;Wong and Mooney, 2007;Lu et al., 2009), recent work has focused on solving them jointly using a single framework (Chen and Mooney, 2008; Kim and Mooney, 2010;Angeli et al., 2010;Lapata, 2012, 2013).
More recently, there has been some work on an encoder-decoder model (Sutskever et al., 2014) for generating a description from time-series or structured data to solve the subtasks jointly in a single framework, and this model has been proven to be useful (Mei et al., 2016b;Lebret et al., 2016). However, the task of generating a description from numerical time-series data presents difficulties such as the second and third problems mentioned in Section 1. For the second problem, the model needs to be fed with information on delivery time. Also, the model needs arithmetic operations such as subtraction for the third problem because even if we simply apply a copy mechanism (Gu et al., 2016;Gulcehre et al., 2016) to the model, it cannot derive a calculated value such as (3), (5), or (6) in Figure 1 from input. Thus, in this work, we tackle these problems and develop a model on the basis of the encoder-decoder model that can mention a specific numerical value by referring to the input data or producing a processed value with mathematical calculation and mention time-dependent expressions by incorporating the information on delivery time into its decoder.
There has also been some work on generating market comments. Kukich (1983) developed a system consisting of rule-based components for generating stock reports from a database of daily stock quotes. Although she used several components individually and had to define a number of rules for the generation, our encoder-decoder model can perform it with fewer and simpler rules for the calculation. Aoki and Kobayashi (2016) developed a method on the basis of a weighted bi-gram language model for automatically describing trends of time-series data such as the Nikkei Stock Average. However, they did not attempt to refer to specific numerical values such as closing prices and amounts of rises in price although such descriptions are often used in market comments as shown in Figure 1 (3), (5), and (6). In contrast, we present a novel approach to generate natural language descriptions of time-series data that can not only able to describe trends of the data but also mention specific numerical values by referring to the time-series data.

Generating Market Comments
To generate market comments on stock prices, we introduce an encoder-decoder model. Encoderdecoder models have been widely used and proven useful in various tasks of natural language generation such as machine translation (Cho et al., 2014) and text summarization (Rush et al., 2015). Our task is similar to these tasks in that the system takes sequential data and generates text. Therefore, it is natural to use an encoder-decoder model in modeling stock prices. Figure 2 illustrates our model. In describing time-series data, the model is expected to capture various types of change and important values in the given sequence, such as absolute or relative changes and maximum or minimum value, in different time-scales. Moreover, it is necessary to generate time-dependent comments and numerical values that require arithmetic operations for derivation, such as "The closing price of the morning session decreases by 5 yen...". To achieve these, we present three strategies that alter the standard encoder-decoder model. First (Section 3.1), we use several encoding methods for time-series data, as in (1) of Figure 2, to capture the changes and important values. Second (Section 3.2), we incorporate delivery-time information into the decoder, as in (2) of Figure 2, to generate time-dependent comments. For the decoder, we use a recurrent neural network language model (RNNLM) (Mikolov et al., 2010), which is widely used in language generation tasks. Finally (Section 3.3), we extend the decoder to estimate arithmetic operations, as in (3)

Encoding Numerical Time-Series Data
We prepare short-and long-term data, using the five-minute chart of Nikkei 225. A vector for short-term data consists of the prices of one trading day and has N elements. We denote it as i=0 . On the other hand, a vector for long-term data consists of the closing prices of the M preceding trading days. It is denoted as Data are commonly preprocessed to remove noise and enhance generalizability of a model (Zhang and Qi, 2005;Banaee et al., 2013a). We use two preprocessing methods: standardization and moving reference. Standardization substitutes each element x i of input x by where µ and σ are the mean and standard deviation of the values in the training data, respectively. Standardized values are less affected by scale. The second method, moving reference (Freitas et al., 2009), substitutes each element x i of input x by where r i is the closing price of the previous trading day of x. This is introduced to capture price fluctuations from the previous day. By applying one of the preprocessing methods to x short and x long , we obtain two vectors of preprocessed values l short and l long . Given these, each encoder emits the corresponding hidden states h short and h long . After obtaining the hidden states, we concatenate the two vectors of the preprocessed values and the outputs of the encoders as a multilevel representation of the input time-series data. The multi-level representation is an approach developed by Mei et al. (2016a) that enable the decoder to take into account both the high-level representation, e.g., h short , h long , and the low-level representation, e.g., l short , l long , at the same time. They have shown that it improves performance in terms of selecting salient objects in input data. We thus set the initial hidden state s 0 of the decoder as where ⊕ is the concatenation operator. When we use both preprocessing methods, we have four preprocessed input vectors: l move short , l std short , l move long , and l std long . In this case, we introduce four encoders, and set the initial hidden state s 0 of the decoder as Since several encoding methods can be used for the time-series data, we use any one of the three conventional neural networks: Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN), or Recurrent Neural Network (RNN) with Long Short-Term Memory cells (Hochreiter and Schmidhuber, 1997). In the experiments, we empirically evaluate and compare the encoding methods.

Incorporating Time Embedding
Even if identical sequences of values are observed, comments usually vary in accordance with price history or the time they are observed. For instance, when the market opens, comments usually mention how much the stock price has increased or decreased compared with the closing price of the previous trading day, as in (1) and (3) in Figure 1.
Our model creates vectors called time embedding vectors T on the basis of the time when the comment is delivered (e.g., 9:00 a.m. or 3:00 p.m.). Then a time embedding vector is added to each hidden state s j in decoding so that words are generated depending on time. This mechanism is inspired by speaker embedding introduced by . They use an encoder-decoder model for a conversational agent that inherits the characteristics of a speaker, such as his/her manner of speaking. They encode speaker-specific information (e.g., dialect, age, and gender) into speaker embedding vectors and used them in decoding.

Estimation of Arithmetic Operations
Text generation systems based on language models such as RNNLM often generate erroneous words for named entities; that is, they often mention a similar but incorrect entity, e.g., Nissan for Toyota. To overcome this problem, Gulcehre et al. (2016) developed a text generation method called copy mechanism. The method copies rare words missing from the vocabulary from a given sequence of words using an attention mechanism, and emits the copied words.
Market comments often mention numerical values that appear in the input data, but they also mention values obtained through arithmetic operations, such as differences in prices as in (3) and (6) (5). Thus, another problem arises: what type of operation is suitable for text to be generated? In this work, we solve this problem by extending the idea of copy mechanism.
To enable our model to generate text with values calculated from input values, we add generalization tags to the vocabulary used in the model. Each generalization tag represents a type of arithmetic operation. When a generalization tag is emitted, the model performs the operations on the designated values in accordance with the tag, replaces the tag with the calculated value, and finally outputs text containing numerical values. For preprocessing, we replace each numerical value appearing in the market comments in the training data with generalization tags such as <price1>. The tag for a numerical value depends on what the value stands for in the text. Since this comment omits the phrase "than the Tag Arithmetic operation <price1> Return ∆ <price2> Round down ∆ to the nearest 10 <price3> Round down ∆ to the nearest 100 <price4> Round up ∆ to the nearest 10 <price5> Round up ∆ to the nearest 100 <price6> Return z as it is <price7> Round down z to the nearest 100 <price8> Round down z to the nearest 1,000 <price9> Round down z to the nearest 10,000 <price10> Round up z to the nearest 100 <price11> Round up z to the nearest 1,000 <price12> Round up z to the nearest 10,000 Table 1: Generalization tags and corresponding arithmetic operations. Here z and ∆ stand for latest price and difference between z and closing price of previous trading day.
closing price of the previous day", 227 in this example indicates the difference between the closing price of the previous trading day x long, M−1 and the latest price x short, N −1 denoted by z in Table 1. Therefore, we replace 227 with the tag <price1>. Likewise, we replace 16,610 with <price6> because it represents the latest price z. To find the optimal tag for each value, we try all the types of operations listed in Table 1 using the values appearing in the text, i.e., 227 and 16,610 in this case. Then, we select the tag that has the operation that yields the value closest to the original one.
In prediction, the model first generates a tentative comment, which includes tags as well as words. Suppose that the input vectors are x short and x long , with x short, N −1 = 14508 and x long, M−1 = 14612, and that the model generates the comment below: (b) Nikkei opens turning down. The loss exceeds <price2> yen, and it falls to the <price7> yen level.
Since the tag <price2> represents "the difference between x short, N −1 and x long, M−1 rounded down to the nearest 10", we replace the tag with 100. Similarly, we replace <price7>, which is "the last price x short, N −1 rounded down to the nearest 100", with 14,500. Finally, we have a market comment containing the numbers as below: (c) Nikkei opens turning down. The loss exceeds 100 yen, and it falls to the 14,500 yen level.

Experimental Settings
We used the five-minute chart of Nikkei 225 from March 2013 to October 2016 as numerical timeseries data, which were collected from IBI-Square Stocks 1 , and 7,351 descriptions as market comments, which are written in Japanese and provided by Nikkei QUICK News. We divided the dataset into three parts: 5,880 for training, 730 for validation, and 741 for testing. For a human evaluation, we randomly selected 100 comments and their time-series data included in the test set. We set N = 62, which is the number of time steps for stock prices for one trading day, and M = 7, which is the number of the time steps for closing prices of the preceding trading days. We used Adam (Kingma and Ba, 2015) for optimization with a learning rate of 0.001 and a mini-batch size of 100. The dimensions of word embeddings, time embeddings, and hidden states for both the encoder and decoder are set to 128, 64, and 256, respectively. For CNN, we used a single convolutional layer and set the filter size to 3.
In the experiments, we conducted three types of evaluation: two for automatic evaluation, and one for human evaluation. For one automatic evaluation, we used BLEU (Papineni et al., 2002) to measure the matching degree between the market comments written by humans as references and output comments generated by our model. We applied paired bootstrap resampling (Koehn, 2004) for a significance test. For the other automatic evaluation metric, we calculate F-measures for time-dependent expressions, using market comments written by humans as references, to investigate whether our model can correctly output timedependent expressions such as "open with" and describe how the price changes compared with the previous period referring to the series of preceding prices such as "continual fall". Specifically, we calculate F-measures for 13 expressions shown in Figure 3.
For the human evaluation, we recruited a specialist in financial engineering as a judge to evaluate the quality of generated market comments. To evaluate the difference in the quality of generated comments between our models and human, we showed both system-generated and humangenerated market comments together with their   time-series data consisting of x short and x long , without letting the judge know which comment is generated by which method. We asked the judge to give each market comment two scores: one for informativeness and one for fluency. Both scores have two levels, 0 or 1, where 1 indicates high informativeness or fluency. For informativeness, the judge used both generated comments and their input stock prices to rate the comments. Specifically, if the judge deem that a generated comment describes an important price movement or an outline of the movement properly, such comments are considered to be informative. For fluency, the judge read only the generated comments and rate them in terms of readability, regardless of their content of the comment.
In addition, since some of the market comments written by humans sometimes include external information such as "Nikkei opens with a continual fall as yen pressures exporters", we also asked the judge to ignore the correctness of external information mentioned in comments, for the sake of fairness in comparison, because external information cannot be retrieved from the time-series data.
To assess the effectiveness of the techniques we introduced, we conducted experiments with 11 models. Table 2 shows an overview of the models  Table 3: BLEU scores on the test set. Differences between the best model, mlp-enc, and other models are statistically significant at p < 0.05.
we compared. We compared three types of models: a baseline, full models (e.g., mlp-enc), and ablated models (e.g., -short). For example, -short is a model that does not use the short-term time series. Table 3 shows the BLEU scores on the test set. Figure 3 presents the F-measure of the models for each phrase. We also present output examples with human-generated market comments (Human) for reference in Figure 4. In the results for the automatic evaluation in BLEU, the model using both MLP as encoders and all the techniques we developed, mlp-enc, outperformed baseline and the other models. The BLEU scores and F-measure values revealed differences among the models using MLP, CNN, or RNN (mlp- Nikkei heikin, han-patsu zen-bike wa 81 en daka no <unk> en Nikkei average, rebound-pop first_half-closing top 81 yen higher gen <unk> yen Nikkei rebounds. The closing price of the morning session is <unk> yen, which is 81 yen higher.  enc, cnn-enc, rnn-enc). In the comparison between the models that took two types of the time-series data x short , x long as input (e.g., mlp-enc or rnn-enc) and the models that only used one of them (-short, -long), the models using both types of data such as mlp-enc and rnn-enc gained higher BLEU scores than -short and -long. Also, the models that encoded the two types of time-series data to capture their short-and long-term changes correctly output more expressions that described the changes such as "turn to rise", "continue to fall", and "rebound" than -short and -long as shown in Figure 3.

Results
According to the comparison between prepro-cessing methods, mlp-enc, which used both standardization and moving reference as preprocessing methods, obtained a higher BLEU score than the models that used neither (-std, -move). In terms of the F-measure values, mlp-enc output phrases mentioning changes more appropriately and therefore achieved the higher values than the other two models as in "turn to rise" or "turn to fall" in Figure 3. Furthermore, we found that the BLEU score of -multi, which did not use the multi-level representation of the data, was inferior. In other words, incorporating the multi-level representation along with an output of an encoder into a decoder seems  Figure 5: BLEU scores of market comments generated by models for each size of training data on the validation set.
to contribute to improving the automatic evaluation and producing a better representation of the input data. baseline and -num output numerical values as "words" from the vocabulary for RNNLM because these models do not use any arithmetic operation. Therefore, there were many cases including <unk> that should be output as a numerical value as shown in Figure 4 (a). We found that -num had a lower BLEU score than the models such as mlp-enc and -std that used arithmetic operations. Furthermore, we observed that the models with arithmetic operations correctly generated stock prices in most cases.
By comparing -time, which did not incorporate time-embeddings into a decoder, and other models such as mlp-enc with respect to the F-measure of expressions depending on delivery time (e.g., "open with" or "closing session"), we found that the models that took time information into account, such as mlp-enc, generated those phrases more accurately than -time.
Moreover, we analyzed the effect of different sizes of training data. Figure 5 shows BLEU scores of market comments generated by our models for each size of training data on the validation set. According to the results, we found that the BLEU scores for the models saturated when we used 3000 training data. In addition, there was not much difference in convergence speed among the models.
The human evaluation results in Table 4 indicate that market comments generated by our model (mlp-enc) achieved a quality comparable even to that of market comments written by humans. Moreover, we found that mlp-enc signifi-  cantly outperformed baseline in terms of informativeness but was outperformed by baseline in terms of fluency. The reason was that mlp-enc occasionally generated a market comment such as "Nikkei gains more than 0 yen" because of an error in the prediction of the operation, and such comments were not considered not to be fluent or informative by the judge, although most of comments generated by mlp-enc were as fluent as those of baseline. Note that baseline does not generate expressions like "0 yen" because they are not normally used in market comments and so not included in the vocabulary. Therefore, the judge considered all the comments generated by baseline to be fluent. For another possibility to enhance our model, we have to consider that the model should mention a difference or gain for a duration from when to when. For example, our current model sometimes generated a market comment such as "Nikkei gains more than 200 yen", although Nikkei actually gained more than 300 yen. Such a comment is not incorrect but is imprecise. Therefore, we consider that a mechanism is needed to select the period to be mentioned when the model generates a comment to this problem and increase the generalizability of our model for generating a description from various time-series data.

Conclusion and Future Work
In this study, we presented a novel encoder-decoder model to automatically generate market comments from numerical time-series data of stock prices, using the Nikkei Stock Average as an example. Descriptions of numerical time-series data written by humans such as market comments have several writing style characteristics. For example, (1) content to be mentioned in the market comments varies depending on short-or long-term changes of the time-series data, (2) expressions depending on delivery time at which text is written are used, and (3) numerical values obtained through arith-metic operations applied to the input data are often described. We developed approaches for generating comments that have these characteristics and showed the effectiveness of the proposed model.
In future work, we plan to apply our model to descriptions of time-series data in various domains such as weather forecasts and sports, which share the above writing-style characteristics. We also plan to use multiple time-series as input such as multiple brands of stock.