Using Twitter Language to Predict the Real Estate Market

We explore whether social media can provide a window into community real estate -foreclosure rates and price changes- beyond that of traditional economic and demographic variables. We find language use in Twitter not only predicts real estate outcomes as well as traditional variables across counties, but that including Twitter language in traditional models leads to a significant improvement (e.g. from Pearson r = :50 to r = :59 for price changes). We overcome the challenge of the relative sparsity and noise in Twitter language variables by showing that training on the residual error of the traditional models leads to more accurate overall assessments. Finally, we discover that it is Twitter language related to business (e.g. ‘company’, ‘marketing’) and technology (e.g. ‘technology’, ‘internet’), among others, that yield predictive power over economics.


Introduction
The massive amount of text provided by users of social media like Facebook and Twitter give researchers the opportunity to investigate topics that were not previously tangible. Specifically, the study of economic outcomes has been turning to the use of social media data in order capture non-traditional factors like consumer mood. For instance, researchers have attempted to predict the stock market by measuring mood from twitter feeds (Bollen et al., 2011), used Twitter data to measure socio-economic indicators and financial markets (Mao, 2015), shown correlation of consumer confidence with sentiment word frequencies in twitter messages over time (O'Connor et al., 2010), and predicted movie revenue using so-cial media and text mining (Asur and Huberman, 2010;Joshi et al., 2010;Yu et al., 2012).
Here, we attempt to leverage social media to understand another economic phenomena, real estate. Our goal is to determine whether language from Twitter can predict real-estate foreclosure rates and price changes, cross-sectionally across counties, beyond that of traditional economic variables. We suspect this is possible because a community's language in social media may capture economic-related community characteristics that are not otherwise easily available. However, the challenge is incorporating noisy high-dimensional language features in such a way that they can contribute beyond the robust low-dimensional traditional predictors (i.e. demographics, median income, education rates, unemployment rates).
The contributions of this paper follow. First, we show that county real estate market outcomes can be predicted from language in social media beyond traditional factors. Second, we address the challenge of effectively leveraging multi-modal feature types (i.e. socioeconomic variables, which are individually very predictive (Nguyen, 2016); and social media linguistic features, which are individually noisy) by demonstrating that a 2-step residualized control approach to learning a predictive model leads to more accuracy than jointly learning all feature parameters at once. This represents the first work to investigate the use of language in Twitter to predict real estate related outcomes -foreclosure and increased price rates.

Related Work
Much of the research on prediction of housing markets has focused on economic conditions. For instance, others have found strong relationships between housing prices and the stock market (Gyourko and Keim, 1992;Case et al., 2005), credit and income (Ortalo-Magne and Rady, 2006), past market prices (Ghysels et al., Except Kaplanski et al. (2012), who looked at daylight hours, few have ventured beyond direct economic factors as predictors of real estate outcomes. Our belief is that language analyses in social media can offer predictive value beyond that of economics in that they capture aspects of people's daily life that are not traditionally available to economists.
While exploiting social media language has not been studied in the real estate domain, use of language predictors has been increasing for other economic-related applications, like measuring the public health using analysis of messages in social media (Paul and Dredze, 2011;Eichstaedt et al., 2015;Culotta, 2014), in addition to predicting stock market exploiting text in social media (Bollen et al., 2011;Zhang et al., 2011;Tsolacos, 2012), and predicting political behaviour considering tweets (DiGrazia et al., 2013). Perhaps the most similar work to ours used manually selected keywords in Google searches to predict the overall US housing market (Wu and Brynjolfsson, 2013). Still, while Google has allowed researchers to tap into one aspect of the online world, search data is only available for specific scales and relying on manually-chosen keywords can restrict predictive performance (Schwartz et al., 2013). We leverage open-vocabulary features (i.e. not based on manual keyword lists) and attempt to predict real-estate at the level of US counties.

Language Model
We learn a model from the Twitter language of US counties to predict real estate outcomes. We extract community language features from tweets and then we learn models for the cross-county prediction task, handling both traditional predictors and linguistic predictors. We focus on two outcomes per county, foreclosure and increased price rates (zillow website, 2016), and consider a wide variety of traditional socioeconomic and demographic predictors to compare. Specifically, socioeconomic variables include median income, unemployment rate and percentage with bachelors degrees while demographic variables include median age; percentage: female, black, hispanic, foreign born, married; and population density. All variables were obtained from US Census (census bureau, 2010), and we henceforth refer to them as a whole as controls.

Features
We build feature vectors from the raw tweets by extracting 1, 2, and 3-grams as well as mentions of 2000 LDA topics based on posteriors we downloaded which were previously estimated from social media (Schwartz et al., 2013). Features were limited to those mentioned by at least 25% of counties, leaving us with 13, 359 1to3-grams and all 2, 000 topics.
Since there are only 1, 347 counties, to which we plan to apply the model (data described in evaluation) but tens of thousands of predictors, we utilize feature selection and dimensional reduction to avoid overfitting. We limit ourselves to features with at least a small linear relationship to the outcome, having a family-wise error alpha of 200 (Efron, 2012). Then, we perform randomized principal components analysis (RPCA) , an approximate PCA based on stochastic re-sampling (Rokhlin et al., 2009)

Learning
We learn four different models: (1) a control model using the socioeconomic & demographic variables, (2) a language model using only tweetderived features, (3) a combined model using both socioeconomics & demographics and language in a single model, and (4) a language over residualized control model fitting language to the residual error of the control model. With the control model as our baseline, we investigate whether language alone (model 2) or adding language to the control model (models 3 and 4) increases accuracy. All models except the 4th are learned via L2 penalized ("ridge") regression (Goeman et al., 2016). 2 Residualized Control Approach In order to effectively exploit Twitter language in our model, we suspect that we need to treat the language features (which are numerous, noisy, more biased, and non-normal) differently than the control variables (which are few, mostly unbiased, and mostly normal). In other words, simply combining the two may lead to losing the importance of the controls amongst the numerous features. 3 As depicted in Figure 1, we build a language model over the residual error of the control model, allowing independent consideration of the two sets of features and different penalties. More specifically, the training phase consists of three steps: (1) train a model using the socioeconomics & demographics, which is the control model, as in Eq.1, (2) calculate the training errors and consider this error as our new label, described in Eq.2, and (3) train a language model over this new data, which is shown in Eq.3. In the end, our model is depicted in Eq.4. In these equations α and γ are the coefficient of control features and language features, and β and λ are the interceptions. For testing pur-2 For the control model, which has few features by comparison, the ridge penalty is essentially zero and standard multivariate linear regression produces comparable results 3 In fact, our results show such a combined model performs only marginally better than a language alone model. pose we feed each data to both control model and language model, and then report the summation of their predictions as the final predicted label.
⇒ y α×X control +γ×X language +(λ+β) (4) The resulting model, a combination of the control model and language model, is still an affine model w.r.t. the language and control features. Thus, its possible ridge-regression over all the features at once could give us the same result (i.e. hyperplane). However, since we suspect that each socioeconomic and demographic feature are more informative and less noisy than the Twitter features, we explore this two-stage learning procedure in order to bias our model toward favoring the role of socioeconomics & demographics over language features.

Evaluation
Here we evaluate the power of Twitter language to predict cross-county real-estate outcomes compared to demographic and socioeconomic factors.

Data Set
We are using 3 different sources of data: a language dataset from Twitter messages, a control dataset of socioeconomic and demographic variables, and an outcome dataset of housing related data. Our language data was derived from Twitter's 1% random stream collected from 2011 to 2013 and included 131 million tweets that are mapped to 1, 347 counties based on their selfreported location following the procedure of Eichstaedt et al. (2015). Our control data included the previously mentioned socioeconomic and demographic variables which were obtained from 2010 US Census data (census bureau, 2010). This  combining the language and the control features into a single model. bold indicates significant improvement (p < 0.05) over combined model. dataset is only collected every 10 years, so the 2010 US Census is the most recent dataset for all of the socioeconomic and demographic variables at the county level. As outcomes, our real estate data, including the foreclosure rate (the number of homes (per 10,000 homes sold) that were foreclosed) and increased-price rate (the percentage of homes with values that have increased in the past year) were downloaded from Zillow and covering 2011 to 2013 (zillow website, 2016). Considering all these data sets, we end up with 427 counties having foreclosure rate outcome data, and 717 counties having increase price rate data. 4 . Table 1 reports the effect of building a language model over the residual of socioeconomics, demographics, and socioeconomics & demographics by comparing them with the control models. All of the results were produced by 10 fold crossvalidation. We see a significant improvement of exploiting language (p < 0.05 according to paired t-test) above and beyond socioeconomic and demographic factors for both the outcomes of foreclosures (from r = .37 to r = .42) and increased price (from r = .50 to r = .59). This suggests that language on Twitter does, in fact, capture information about a community that is not captured by the traditional predictors.

Results
We next explored whether building language model using the residualized control approach performs better than a model combining control and language features in a single learning step. Results are in Table 2, showing that building language model over residual performs significantly better than a combined model for both of the out-comes. In fact, the gap is .10 in Pearson r for increased price. Further, it also appears possible that the combined feature model could perform worse than the control model in some cases, presumably because the controls are lost when being fit with the language. In a sense, the residualized control approach utilizes a prior that each socioeconomic and demographic feature are more informative than a single word and should thus receive a different penalty parameter or be fit independently. It worth noting that this method is applicable for many different learning algorithms (e.g. SVM, deep convolutional net).
As mentioned previously, one limitation of the traditional predictors is that many are only available every 10 years as part of the US Census. We primarily focused on Twitter data that was a couple years removed from the last census, which may explain the improvement. Thus, we also ran an experiment using the Twitter data from (Schwartz et al., 2013) which spans 2009 to 2010, and found similar results: the residualized control approach improved the Pearson r for 'increased price' from 0.36 to 0.44 and for 'foreclosure' from 0.65 to 0.69. Thus, the improvements provided by the residualized control approach do not appear to be due to the fact that twitter data are newer than control data.
We have shown that Twitter language is adding predictive information about the real estate market beyond that of traditional socioeconomic predictors. So, just what exactly are tweets capturing that socioeconomics are not? Toward this, we ran a differential language analysis to identify the top 50 most predictive features (independently) of increased price, the outcome which we performed the best. Figure 2 shows the results controlled by socioeconomic and location features (US state indicator), limited to those passing a Benjamini-Hochberg False Discovery rate alpha of 0.01 (Benjamini and Hochberg, 1995). We see that, although each displayed n-gram was predictive beyond socioeconomics, many of them suggest a more nuanced economic characterization of a community (e.g. 'technology', 'media', 'internet', and 'marketing'), suggesting avenues of future exploration for better understanding the housing market.

Conclusion
While the real estate market of a community is believed to be affected by many factors, traditionally only coarse economic and demographic variables have been accessible at scale to market researchers and forecasters. Here, we explored the prediction power of language in the real estate market as compared to traditional predictors, showing that language in twitter is predictive of foreclosure rates and price increases and that a residualized control approach to combine language features with traditional variables can lead to more accurate models. We believe this can open the door to more a nuanced and precise understanding of the real-estate market.