Reading Documents for Bayesian Online Change Point Detection

Modeling non-stationary time-series data for making predictions is a challenging but important task. One of the key issues is to identify long-term changes accurately in time-varying data. Bayesian On-line Change Point Detection (BO-CPD) algorithms efﬁciently detect long-term changes without assuming the Markov property which is vulnerable to local signal noise. We propose a Document based BO-CPD (DBO-CPD) model which automatically detects long-term temporal changes of continuous variables based on a novel dynamic Bayesian analysis which combines a non-parametric regression, the Gaussian Process (GP), with generative models of texts such as news articles and posts on social networks. Since texts often include important clues of signal changes, DBO-CPD enables the accurate prediction of long-term changes accurately. We show that our algorithm outperforms existing BO-CPDs in two real-world datasets: stock prices and movie revenues.


Introduction
Time series data depends on the latent dependence structure which changes over time. Thus, stationary parametric models are not appropriate to represent such dynamic non-stationary processes. Change point analysis (Smith, 1975;Stephens, 1994;Chib, 1998;Barry and Hartigan, 1993) focuses on formal frameworks to determine whether a change has taken place without assuming the Markov property which is vulnerable to local signal noise. When change points are identified, each part of the time series is approximated by specified parametric models under the stationary assumptions. Such change point detection models have successfully been applied to a variety of data, such as stock markets (Chen and Gupta, 1997;Hsu, 1977;Koop and Potter, 2007), analyzing bees' behavior (Xuan and Murphy, 2007), forecasting climates (Chu and Zhao, 2004;Zhao and Chu, 2010), and physics experiments (von Toussaint, 2011). However, offline-based change point analysis suffers from slow retrospective inference which prevents real-time analysis.
Bayesian Online Change Point Detection (BO-CPD) (Adams and MacKay, 2007;Steyvers and Brown, 2005;Osborne, 2010;Gu et al., 2013) overcomes this restriction by exploiting efficient online inference algorithms. BO-CPD algorithms efficiently detect long-term changes by analyzing continuous target values with the Gaussian Process (GP), a non-parametric regression method. The GP-based CPD model is simple and flexible. However, it is not straightforward to utilize rich external data such as texts in news articles and posts in social networks.
In this paper, we propose a novel BO-CPD model that improves the detection of change points in continuous signals by incorporating the rich external information implicitly written in texts on top of the long-term change analysis of the GP. In particular, our model finds causes of signal changes in news articles which are influential sources of markets of interests.
Given a set of news articles extracted from the Google News service and a sequence of target, continuous values, our new model, Documentbased Bayesian Online Change Point Detection (DBO-CPD), learns a generative model which represents the probability of a news article given the run length (a length of consecutive observations without a change). By using the new prior, DBO-CPD models a dynamic hazard rate (h) which determines the rate at which change points occur.
In experiments, we show that DBO-CPD can effectively distinguish whether an abrupt change is a change point or not in real-world datasets (see Section 3.1). Compared to previous BO-CPD models which explain the changes by human manual mappings, our DBO-CPD automatically explains the reasons why a change point has occurred by connecting the numerical sequence of data and textual features of news articles.

Bayesian Online Change Point Detection
This section will review our research problem, the change point detection (CPD) (Barry and Hartigan, 1993), and the Bayesian Online Change Point Detection (BO-CPD) (Adams and MacKay, 2007) and our model, Document Based Online Change Point Detection (DBO-CPD).
Let x t ∈R be a data observation at time t. We assume that a sequence of data (x 1 , x 2 , ..., x t ) is composed of several non-overlapping productive partitions (Barry and Hartigan, 1992). The boundaries that separate the partitions is called the change points. Let r be the random variable that denotes the run length, which is the number of time steps since the last change point was detected. r t is the current run at time t. x (rt) t denotes the most recent data corresponding to the run r t .

Online Recursive Detection
To make an optimal prediction of the next data x t+1 , one may need to consider all possible run lengths r t ∈N and a probability distribution over run length r t . Given a sequence of data up to time t, x 1:t = (x 1 , x 2 , ..., x t ), the run length prediction problem is formalized as computing the joint probability of random variables P (x t+1 , x 1:t ). This distribution can be calculated in terms of the posterior distribution of run length at time t, P (r t |x 1:t ), as follows: The predictive distribution P (x t+1 |r t , x (rt) t ) depends only on the most recent r t observations x (rt) t . The posterior distribution of run length P (r t |x 1:t ) can be computed recursively: where: P (x 1:t ) = rt P (r t , x 1:t ).
The joint distribution over run length r t and data x 1:t can be derived by summing P (r t , r t−1 , x 1:t ) over r t−1 : This formulation updates the posterior distribution of the run length given the prior over r t from r t−1 and the predictive distribution of new data.
However, the existing BO-CPD model (Adams and MacKay, 2007) specifies the conditional prior on the change point P (r t |r t−1 ) in advance. This approach may lead to model biased predictions because the update formula highly relies on the predefined, fixed hazard rate (h). Furthermore, BO-CPD is incapable of incorporating external information that implicitly influences the observation and explains the reasons for the current change of the long-term trend.

Document-based Bayesian Online Change Point Detection
This section explains our DBO-CPD model. To represent the text documents, we add a variable D which denotes a series of text documents related to the observed data as shown in Figure 1. Let D t be a set of N t text documents D 1 t , D 2 t , ..., D Nt t that are indexed at time of publication t, where N t is the number of documents observed at time t. Then, we can rewrite the joint probability over the run length as: is the set of the r t most recent documents. Figure 2 illustrates the recursive updates of posterior probability where solid lines indicate that the probability mass is passed upwards and dotted lines indicate the probability that the current run length r t is set to zero.
Given documents D (rt) t , the conditional probability is represented as follows: where P gap is the distribution of intervals between consecutive change-points. As the BO-CPD model (Adams and MacKay, 2007), we assume the simplest case where the probability of a changepoint at every step is constant if the length of a segment is modeled by a discrete exponential (geometric) distribution as: where λ > 0, a rate parameter, is the parameter of the distribution. The update rule for the prior distribution on r t makes the computation of the joint distribution tractable, γ+1 γ=1 P (r t−1 =γ, D t |r t =γ)·P gap (γ). Because r t can only be increased to γ + 1 or set to 0, the conditional probability is as follows: length parameter r t . In this setting, the conditional probability of the words takes the following form: is represented by two generative models, φ wf and φ wi which illustrates word frequency and word impact, respectively. The key intuition of word frequency is that a word tends to close to a change point if a word has been frequently seen in articles, published when there was a rapid change.
The key intuition of word impact is that how much does a word lose information in time which will be discussed in next section. In our paper, we use the unnormalized beta distribution of the weights of words to represent the exponential decays. The probability P D (γ) t |r t =γ + 1 can be represented recursively as: where: .
Here, φ wi (d x,y t |γ) and φ wf (d x,y t |γ) are empirical potentials which contribute to represent P (d i,j t |γ). φ wi (·) is explained in Section 2.3. Here, count(E) is the number of times event E appears in the dataset. In Equation (9), τ t is the time gap (difference) between t and the time when a document is generated, and d i,j represents a document without considering the time domain.
T D (t, γ|0) is represented as follows: where H(τ ) is the hazard function (Forbes et al., 2011),  (9) is calculated and how it determines whether a change occurs or not. If the same data is given, BO-CPD gives us the same answer to a question whether an abrupt change at time t is a change point or not. However, DBO-CPD uses documents D γ t for its prediction to incorporate the external information which cannot be inferred only from the data.
When P gap is the discrete exponential distribution, the hazard function is constant at H(τ ) = 1/λ (Adams and MacKay, 2007).
As an illustrative example, suppose that we found a rapid change in Google stock three days ago. Today at t = 3, we want to know how the articles are written and whether it will affect the change tomorrow (t = 4). As shown in Figure 3, we can calculate what degree a word, for example rises or stays, is likely to appear in articles published since today, which is P (D (γ) t |r t = γ+1), and this probability leads us to predict run lengths from the texts. Documents for each τ t = 0, 1 and 2 are generated from the generative models with a given predicted run length through recursive calculation of the Bayesian models which enables online prediction as shown in Equation (9). This is the main contribution of this paper that enables DBO-CPD to infer change points accurately with information included in text documents.

Generative Models Trained from Regression
Let D ∈ R T ×N ×M be N documents of news articles which consist of M vocabulary over time domain T . D i t ∈ R M is the ith document of a set of documents generated at time t, and define r ∈ R N as the corresponding set of the run length, which is a time gap between the time when the document is generated and the next change point occurs. Then, given a text document D i t , we seek to predict the value of run length r by learning a parameterized function f : where w ∈ R d are the weights of text features for d i,1 t , d i,2 t , ..., d i,M t which compose documents D i t . From a collection of N documents, we use linear regression which is trained by solving the following optimization problem: where r(w) is the regularization term and ξ(w, D i t , r t ) is the loss function. Parameter C > 0 is a user-specified constant for balancing r(w) and the sum of losses.
Let h be a function from a document into a vector-space representation ∈ R d . In linear regression, the function f takes the form: where is Gaussian noise. Figure 4 illustrates how we trained a linear regression model on a sample article. One issue is that the run length can not be trained directly. Suppose that we train r 5 = 0 into regression, the weight w of the model will become 0 even though the set of words contained in D j 5 , ∀j ∈ {1, ..., T } is composed of salient words which can incur a possible future change point. To solve this interpretability problem, we trained the weight in the inverse exponential domain for the predicted variable, predicting e −rt instead of r t . In this setting, the predicted run-length takes the form: By this method, the regression model can give a high weight to a word which often appears close to change points. We can interpret that highly weighted words d are more closely related to an outbreak of changes than lower weighted words.
With w, we can rewrite the probability of d, τ t given w as: The potential, φ wi , can also be represented recursively as follows: since given a word d, τ t+1 = τ t +1 holds.

Experiments
Now we explain experiments of DBO-CPD in two real-world datasets, stock prices and movie revenues. The first case is the historical end-of-day stock prices of five information technology corporations. In the second dataset, we examine daily film revenues averaged by the number of theaters.

Datasets
In the stock price dataset, we gather data for five different companies: Apple (AAPL), Google (GOOG), IBM (IBM), Microsoft (MSFT), and Facebook (FB). These companies were selected because they were the top 5 ranked in market value in 2015.
We chose these technology companies because the announcement of new IT products and features and the interests of public media tend to be higher   The second dataset is a set of movie revenues averaged by the number of theaters for five months from the release date of film. We target 5 different News articles are collected from Google News and we use Google search queries to extract specific articles related to each dataset in a specific time period. During the online article crawling, we store not only the titles of articles, HTML documents, and publication dates, but also the number of related articles. The number of articles is used to differentiate the weight of news articles during the training of regression. In the case of stock price data, we use two different queries to decrease noise. First, we search with the company name such as 'Google'. Then, we use queries specific to stock 'NASDAQ:' to make the content of articles to be highly relevant to the stock market. In case of movie data, we search with the movie title with the additional word 'movie' to only collect articles related to the target movie.

Textual Feature Representation
After extracting texts from HTMLs, we tokenize the texts into words. We use three different tokenization methods which are downcasing the characters, punctuation removal, and removing English stop words. Table 1 shows the statistics on the corpora of collected news articles.
With these article corpora, we use a bag-ofwords (BoW) representation to change each word into a vector representation where words from articles are indexed and then weighted. Using these vectors, we adopt three document representations, TF, TFIDF, and LOG1P, which extend BoW representation. TF and TFIDF (Sparck Jones, 1972) calculate the importance of a word to a set of documents based on term frequency. LOG1P (Kogan et al., 2009) calculates the logarithm of the word frequencies.

Training BO-CPD
As we noted earlier, we use BO-CPD to train the regression model to learn high weight for words which are more related to changes. When we choose the parameters for the Gaussian Process of BO-CPD, we try to find the value which makes the distance of intervals between predicted change points around 1-2 weeks. This is because we assume that the information included in the articles will have an immediate effect on the data right after it is published to the public, so the external information in texts will indicate the short-term causes for a future change.
For the reasonable comparison of BO-CPD and DBO-CPD, we use the same parameter for the Gaussian Process in both models. After several experiments we found that a = 1 and b = 1 for the Gaussian Process and λ gap = 250 is appropriate to train BO-CPD in the stock and film datasets. We separate the training and testing examples for cross-validation at a ratio of 2 : 1 for each year. Then we train each model differently by year.

Learning the strength parameter w from Regression
The weight w of the regression model gives us an intuition of how a word is important which affect  to the length of the current run. With the predicted run length calculated in Section 3.3, we change the run length domain r ∈ R into 0 ≤ r ≤ 1 by predicting e rt rather than r t to solve the interpretability problem. Therefore, we can think of a high weight w i as a powerful word which changes the current run length r to 0. To maintain the scalability of w, we normalize the weight by rescaling the range into w ∈ [−1, 1]. With the word representation calculated in Section 3.2, we train the regression model by using the number of relevant articles as the importance weight of training.

Results
We evaluate the performance of BO-CPD and DBO-CPD by comparing the negative log likelihood (NLL) (Turner et al., 2009) of two models at time t as: log p(x t |x 1:t−1 , w).
We calculate the marginal NLL by year and the results are described in Table 2 and Table 3.  (19.1964) is smaller than BO-CPD (19.3438). The two zoomed intervals are the two longest intervals where the negative log likelihood of DBO-CPD is smaller than BO-CPD. The right table shows the sentences whose run length predicted by the regression model (described in Section 2.3) are the highest at the two zoomed points, which means the sentences are likely to appear near feature change points. The boldface words are the top 5 most strongly-weighted terms in the regression model.
DBO-CPD compared to the BO-CPD is statistically significant with 90% confidence in the four stocks except for stock of Facebook. We also found that most of the DBO-CPD II shows better results than DBO-CPD I and BO-CPD in most datasets due to noise reduction of texts through the additional search query 'NASDAQ:'. Out of 23 datasets, APPL in 2010 and FB in 2012 are the only datasets where NLLs of BO-CPD is smaller (better) than NLLs of DBO-CPD.
One of the advantages of using a linear model is that we can investigate what the model discovers about different terms. As shown in Figure 5, we can find negative semantic words such as vicious, whip, and desperately, and words represent-ing the status of a company like propel, innovations, and grateful are the most strongly-weighted terms in the regression model. We analyze and visualize some change points where NLL of DBO-CPD is lower than NLL of BO-CPD. The results are shown in Figure 6 and three sentences are the top 3 most weighted sentences in the regression model for two changes with the boldface words of top 5 strongly weighted terms like the terms big, money, and steadily. A particularly interesting case is the term earth which is found between Jan. 25 and Feb. 13 in 2013. After we investigated articles where the sentence is included, we found that Google announced a new tour guide feature in Google Earth on Jan. 31 and after this announce-  ment the stock price increased. We can also find that the word million is also a positive term which can predict a new change in the near feature.

Conclusions
In this paper, we propose a novel generative model for online inference to find change points from non-stationary time-series data. Unlike previous approaches, our model can incorporate external information in texts which may includes the causes of signal changes. The main contribution of this paper is to combine the generative model for online change points detection and a regression model learned from the weights of words in documents. Thus, our model accurately infers the conditional prior of the change points and automatically explains the reasons of a change by connecting the numerical sequence of data and textual features of news articles.

Future work
Our DBO-CPD can be improved further by incorporating more external information beyond documents. In principle, our DBO-CPD can incorporate other features if they are vectorized into a matrix form. Our implementation currently only uses the simple bag of words models (TF, TFIDF and LOG1P) to improve the baseline GP-based CPD models by bringing documents into change point detection. One possible direction of future work would explore ways to fully represent the rich information in texts by extending the text features and language representations like continuous bagof-words (CBOW) models (Mikolov et al., 2013) or Global vectors for word representation (GloVe) (Pennington et al., 2014).