Automatic Domain Adaptation Outperforms Manual Domain Adaptation for Predicting Financial Outcomes

In this paper, we automatically create sentiment dictionaries for predicting financial outcomes. We compare three approaches: (i) manual adaptation of the domain-general dictionary H4N, (ii) automatic adaptation of H4N and (iii) a combination consisting of first manual, then automatic adaptation. In our experiments, we demonstrate that the automatically adapted sentiment dictionary outperforms the previous state of the art in predicting the financial outcomes excess return and volatility. In particular, automatic adaptation performs better than manual adaptation. In our analysis, we find that annotation based on an expert’s a priori belief about a word’s meaning can be incorrect – annotation should be performed based on the word’s contexts in the target domain instead.


Introduction
Since 1934, the U.S. Securities and Exchange Commission (SEC) mandates that public companies disclose information in form of public filings to ensure that adequate information is available to investors. One such filing is the 10-K, the company's annual report. It contains financial statements and information about business strategy, risk factors and legal issues. For this reason, 10-Ks are an important source of information in the field of finance and accounting.
A common method employed by finance and accounting researchers is to evaluate the "tone" of a text based on the Harvard Psychosociological Dictionary, specifically, on the Harvard-IV-4 TagNeg (H4N) word list. 1 However, as its name suggests, this dictionary is from a domain that is different from finance, so many words (e.g., "liability", "tax") that are labeled as negative in H4N are in fact not negative in finance. 1 http://www.wjh.harvard.edu/˜inquirer In a pioneering study, Loughran and Mcdonald (2011) manually reclassified the words in H4N for the financial domain. They applied the resulting dictionaries 2 to 10-Ks and predicted financial variables such as excess return and volatility. We will refer to the sentiment dictionaries created by Loughran and Mcdonald (2011) as L&M.
In this work, we also create sentiment dictionaries for the finance domain, but we adapt them from the domain-general H4N dictionary automatically. We first learn word embeddings from a corpus of 10-Ks and then reclassify them -using SVMs trained on H4N labels -as negative vs. non-negative. We refer to the resulting domainadapted dictionary as H4N RE .
In our experiments, we demonstrate that the automatically adapted financial sentiment dictionary H4N RE performs better at predicting excess return and volatility than dictionaries of Loughran andMcdonald (2011) andTheil et al. (2018).
We make the following contributions.
(i) We demonstrate that automatic domain adaptation performs better at predicting financial outcomes than previous work based on manual domain adaptation. (ii) We perform an analysis of the differences between the classifications of L&M and those of our sentiment dictionary H4N RE that sheds light on the superior performance of H4N RE . For example, H4N RE is much smaller than L&M, consisting mostly of frequent words, suggesting H4N RE is more robust and less prone to overfitting. (iii) In a further detailed analysis, we investigate words classified by L&M as negative, litigious and uncertain that our embedding classifier classifies otherwise; and common (i.e., non-negative) words from H4N that L&M did not include in the categories negative, litigious and uncertain, but that our embedding classifier classifies as belonging to these classes. Our analysis suggests that manual adaptation of dictionaries is error-prone if annotators are not given access to corpus contexts.
Our paper primarily addresses a finance application. In empirical finance, a correct sentiment classification decision is not sufficient -the decision must also be interpretable and statistically sound. That is why we use ordinary least squares (OLS) -an established method in empirical finance -and sentiment dictionaries. Models based on sentiment dictionaries are transparent and interpretable: by looking at the dictionary words occurring in a document we can trace the classification decision back to the original data and, e.g., understand the cause of a classification error. OLS is a well-understood statistical method that allows the analysis of significance, effect size and dependence between predictor variables, inter alia.
While we focus on finance here, three important lessons of our work also apply to many other domains.
(1) An increasing number of applications require interpretable analysis; e.g., the European Union mandates that systems used for sensitive applications provide explanations of decisions. Decisions based on a solid statistical foundation are more likely to be trusted than those by black boxes. (2) Many NLP applications are domain-specific and require domain-specific resources including lexicons. Should such lexicons be built manually from scratch or adapted from generic lexicons? We provide evidence that automatic adaptation works better. (3) Words often have specific meanings in a domain and this increases the risk that a word is misjudged if only the generic meaning is present to the annotator. This seems to be the primary reason for the problems of manual lexicons in our experiments. Thus, if manual lexicon creation is the only option, then it is important to present words in context, not in isolation, so that the domain-specific sense can be recognized.

Related Work
In empirical finance, researchers have exploited various text resources, e.g., news (Kazemian et al., 2016), microblogs (Cortis et al., 2017), twitter (Zamani and Schwartz, 2017) and company disclosures (Nopp and Hanbury, 2015;Kogan et al., 2009). Deep learning has been used for learning document representations (Ding et al., 2015;Akhtar et al., 2017). However, the methodology of empirical finance requires interpretable results. Thus, a common approach is to define features for statistical models like Ordinary Least Squares (Lee et al., 2014;Rekabsaz et al., 2017). Frequently, lexicons like H4N TagNeg 3 (Tetlock et al., 2007) are used. It includes a total of 85,221 words, 4188 of which are labeled negative. The remaining words are labeled "common", i.e., non-negative. Loughran and Mcdonald (2011) argue that many words from H4N have a specialized meaning when appearing in an annual report. For instance, domain-general negative words such as "tax", "cost", "liability" and "depreciation" -which predominate in 10-Ks -do not typically have negative sentiment in 10-Ks. So Loughran and Mcdonald (2011) constructed subjective financial dictionaries manually, by examining all words that appear in at least 5% of 10-Ks and classifying them based on their assessment of most likely usage. More recently, other financespecific lexicons were created (Wang et al., 2013). Building on L&M, Tsai and Wang (2014) and Theil et al. (2018) show that the L&M dictionaries can be further improved by adding most similar neighbors to words manually labeled by L&M.
Seed-based methods generalize a set of seeds based on corpus (e.g., distributional) evidence.
Supervised methods start with a larger training set, not just a few seeds (Mohammad et al., 2013). Distributed word representations (Tang et al., 2014;Amir et al., 2015;Vo and Zhang, 2016;Rothe et al., 2016) are beneficial in this approach. For instance, Tang et al. (2014) incorporate in word embeddings a document-level sentiment signal. Wang and Xia (2017) also integrate document and word levels. Hamilton et al. (2016) learn domain-specific word embeddings and derive word lists specific to domains, including the finance domain.
Dictionary-based approaches (Takamura et al., 2005;Baccianella et al., 2010;Vicente et al., 2014) use hand-curated lexical resources -often WordNet (Fellbaum, 1998)for constructing lexicons. Hamilton et al. (2016) argue that dictionary-based approaches generate better results due to the quality of hand-curated resources. We compare two ways of using a hand-curated resource in this work -a generaldomain resource that is automatically adapted to the specific domain vs. a resource that is manually created for the specific domain -and show that automatic domain adaptation performs better.
Apart from domain adaptation work on dictionaries, many other approaches to generic domain adaptation have been proposed. Most of this work adopts the classical domain adaptation scenario: there is a large labeled training set available in the source domain and an amount of labeled target data that is insufficient for training a high-performing model on its own (Blitzer et al., 2006;Chelba and Acero, 2006;Daumé III, 2009;Pan et al., 2010;Glorot et al., 2011;Chen et al., 2012). More recently, the idea of domainadversarial training was introduced for the same scenario (Ganin et al., 2016). In contrast to this work, we do not transfer any parameters or model structures from source to target. Instead, we use labels from the source domain and train new models from scratch based on these labels: first embedding vectors, then a classifier that is trained on source domain labels and finally a regression model that is trained on the classification decisions of the classifier. This approach is feasible in our problem setting because the divergence between source and target sentiment labels is relatively minor, so that training target embeddings with source labels gives good results.
The motivation for this different setup is that our work primarily addresses a finance application where explainability is of high importance. For this reason, we use a model based on sentiment dictionaries that allows us to provide explanations of the model's decisions and predictions.

Empirical finance methodology
In this paper, we adopt Ordinary Least Squares (OLS), a common research method in empirical finance: a dependent variable of interest (e.g., excess return, volatility) is predicted based on a linear combination of a set of explanatory variables.
The main focus of this paper is to investigate text-based explanatory variables: we would like to know to what extent a text variable such as occurrence of negative words in a 10-K can predict a financial variable like volatility. Identifying the economic drivers of such a financial outcome is of central interest in the field of finance. Some of these determinants may be correlated with sentiment. To understand the role of sentiment in explaining financial variables we therefore need to isolate the complementary information of our text variables. This is achieved by including in our regressions -as control variables -a standard set of financial explanatory variables such as firm size and book-to-market ratio. These control variables are added as additional explanatory variables in the regression specification besides the textual sentiment variables. This experimental setup allows us to assess the added benefit of text-based variables in a realistic empirical finance scenario.
The approach is motivated by previous studies in the finance literature (e.g., Loughran and Mcdonald (2011)), which show that characteristics of financial firms can explain variation in excess returns and volatility. By including these control variables in the regression we are able to determine whether sentiment factors have incremental explanatory power beyond the already established financial factors. Since the inclusion of these control variables is not primarily driven by the assumption that firms with different characteristics use different language, our approach differs from other NLP studies, such as Hovy (2015), who accounts for nontextual characteristics by training group-specific embeddings.
Each text variable we use is based on a dictionary. Its value for a 10-K is the proportion of tokens in the 10-K that are members of the dictionary. For example, if the 10-K is 5000 tokens long and 50 of those tokens are contained in the L&M uncertainty dictionary, then the value of the L&M uncertainty text variable for this 10-K is 0.01.
In the type of analysis of stock market data we conduct, there are two general forms of dependence in the residuals of a regression, which arise from the panel structure of our data set where a single firm is repeatedly observed over time and multiple firms are observed at the same point in time. Firm effect: Time-series dependence assumes that the residuals of a given firm are correlated across years. Time effect: Cross-sectional dependence assumes that the residuals of a given year are correlated across different firms. These properties violate the i.i.d. assumption of resid-uals in standard OLS. We therefore model data with both firm and time effects and run a twoway robust cluster regression, i.e., an OLS regression with standard errors that are clustered on two dimensions (Gelbach et al., 2009), the dimensions of firm and time. 4 We apply this regressionbased methodology to test the explanatory power of financial dictionaries with regard to two dependent variables: excess return and volatility. This approach allows us to compare the explanatory power of different sentiment dictionaries and in the process test the hypothesis that negative sentiment is associated with subsequently lower stock returns and higher volatility. We now introduce the regression specifications for these tests.

Excess return
The dependent variable excess return is defined as the firm's buy-and-hold stock return minus the value-weighted buy-and-hold market index return during the 4-day event window starting on the 10-K filing date, computed from prices by the Center for Research in Security Prices (CRSP) 5 (both expressed as a percentage). In addition to the independent text variables (see §4 for details), we include the following financial control variables. (i) Firm size: the log of the book value of total assets. (ii) Alpha of a Fama-French regression (Fama and French, 1993) calculated from days [-252 -6]; 6 this represents the "abnormal" return of the asset, i.e., the part of the return not due to common risk factors like market and firm size. (iii) Book-to-market ratio: the log of the book value of equity divided by the market value of equity. (iv) Share turnover: the volume of shares traded in days [-252 -6]

NLP methodology
There are two main questions we want to answer: Q1. Is a manually domain-adapted or an automatically domain-adapted dictionary a more effective predictor of financial outcomes?
Q2. L&M adapted H4N for the financial domain and showed that this manually adapted dictionary is more effective than H4N for prediction. Can we further improve L&M's manual adaptation by automatic domain adaptation?
The general methodology we employ for domain adaptation is based on word embeddings. We train CBOW word2vec (Mikolov et al., 2013) word embeddings on a corpus of 10-Ks for all words of H4N that occur in the corpus -see §4 for details. We consider two adaptations: ADD and RE. ADD is only used to answer question Q2.
ADD. For adapting the L&M dictionary, we train an SVM on an L&M dictionary in which words are labeled +1 if they are marked for the category by L&M and labeled -1 otherwise (where the category is negative, uncertain or litigious). Each word is represented as its embedding. We then run the SVM on all H4N words that are not contained in the L&M dictionary. We also ignore H4N words that we do not have embeddings for because their frequency is below the word2vec frequency threshold. Thus, we obtain an ADD dictionary which is not a superset of the L&M lexicon because it includes only new additional words that are not part of the original dictionary. SVM scores are converted into probabilities via logistic regression. We define a confidence threshold θ -we only want to include words in the ADD dictionary that are reliable indicators of the category of interest. A word is added to the dictionary if its converted SVM score is greater than θ.
RE. We train SVMs as for ADD, but this time in a five-fold cross validation setup. Again, SVM scores are converted into probabilities via logistic regression. A word w becomes a member of the adapted dictionary if its converted SVM score of the SVM that was not trained on the fold that contains w is greater than θ.
To answer our first question Q1: "Is automatic or manual adaptation better?", we apply adaptation method RE to H4N and compare the results to the L&M dictionaries.
To answer our second question Q2: "Can manual adaptation be further improved by automatic adaptation?", we apply adaptation methods RE and ADD to the three dictionaries compiled by L&M and compare results for original and adapted L&M dictionaries: (i) negative (abbreviated as "neg"), (ii) uncertain (abbreviated as "unc"), (iii) litigious (abbreviated as "lit"). Our goals here are to improve the in-domain L&M dictionaries by relabeling them using adaptation method RE and to find new additional words using adaptation method ADD.

Experiments and results
We downloaded 206,790 10-Ks for years 1994 to 2013 from the SEC's database EDGAR. 10 Table of contents, page numbers, links and numeric tables are removed in preprocessing and only the main body of the text is retained. Documents are split into sections. Sections that are not useful for textual analysis (e.g., boilerplate) are deleted.
To construct the final sample, we apply the filters defined by L&M (Loughran and Mcdonald, 2011): we require a match with CRSP's permanent identifier PERMNO, the stock to be common equity, a stock pre-filing price of greater than $3, a positive book-to-market, as well as CRSP's market capitalization and stock return data available at least 60 trading days before and after the filing date. We only keep firms traded on Nasdaq, NYSE or AMEX and whose filings contain at least 2000 words. This procedure results in a corpus of 60,432 10-Ks. We tokenize (using NLTK) and lowercase this corpus and remove punctuation.
We use word2vec CBOW with hierarchical softmax to learn word embeddings from the corpus. We set the size of word vectors to 400 and run one training iteration; otherwise we use word2vec's default hyperparameters. SVMs are trained on word embeddings as described in §3.2. We set the threshold θ to 0.8, so only words with converted SVM scores greater than 0.8 will be added to dictionaries. 11 As described in §3, we compare manually adapted and automatically adapted dictionaries (Q1) and investigate whether automatic adaptation of manually adapted dictionaries further improves performance (Q2). Our experimental setup is Ordinary Least Squares (OLS), more specifically, a two-way robust cluster regression for the time and firm effects. The dependent financial variable is excess return or volatility. We include several independent financial variables in the regression as well as one or more text variables. The value of the text variable for a category is the proportion of tokens from the category that occur in a 10-K.
To assess the utility of a text variable for predicting a financial outcome, we look at significance and the standardized regression coefficient   (the product of regression coefficient and standard deviation). If a result is significant, then it is unlikely that the result is due to chance. The standardized coefficient measures the effect size, normalized for different value ranges of variables. It can be interpreted as the expected change in the dependent variable if the independent variable increases by one standard deviation. The standardized coefficient allows a fair comparison between a text variable that, on average, has high values (many tokens per document) with one that, on average, has low values (few tokens per document). Table 2 gives regression results for excess return, comparing H4N RE (our automatic adaptation of the general Harvard dictionary) with the three manually adapted L&M dictionaries. As expected the coefficients are negatively signed -10-Ks containing a high percentage of pessimistic words are associated with negative excess returns.

Excess Return
L&M designed the dictionary neg lm specifically for measuring negative information in a 10-K that may have a negative effect on outcomes like excess return. So it is not surprising that neg lm is the best performing dictionary of the three L&M dictionaries: it has the highest standard coefficient (-0.080) and the highest significance (-2.56). unc lm performs slightly worse, but is also significant.  However, when comparing the three L&M dictionaries with H4N RE , the automatically adapted Harvard dictionary, we see that H4N RE performs clearly better: it is highly significant and its standard coefficient is larger by a factor of more than 2 compared to neg lm . This evidence suggests that the automatically created H4N RE dictionary has a higher explanatory power for excess returns than the manually created L&M dictionaries. This provides an initial answer to question Q1: in this case, automatic adaptation beats manual adaptation. Table 3 shows manual plus automatic experiments with multiple text variables in one regression, in particular, the combination of H4N RE with each of the L&M dictionaries. We see that the explanatory power of L&M variables is lost after we additionally include H4N RE in a regression: all three L&M variables are not significant. In contrast, H4N RE continues to be significant in all experiments, with large standard coefficients. More manual plus automatic experiments can be found in the appendix. These experiments further confirm that automatic is better than manual adaptation. Table 4 shows results for automatically adapting the L&M dictionaries. 12 The subscript "RE+ADD" refers to a dictionary that merges RE and ADD; e.g., neg RE+ADD is the union of neg RE and neg ADD .
We see that for each category (neg, lit and unc), the automatically adapted dictionary performs better than the original manually adapted dictionary; e.g., the standard coefficient of neg RE is -0.111,    (-2.96) and unc RE (-2.77). We also evaluate neg spec , the negative word list of Hamilton et al. (2016). neg spec does not perform well: it is not significant.
These results provide a partial answer to question Q2: for excess return, automatic adaptation of L&M's manually adapted dictionaries further improves their performance. Table 5 compares H4N RE and L&M regression results for volatility. Except for litigious, the coefficients are positive, so the greater the number of pessimistic words, the greater the volatility.

Volatility
Results for neg lm , unc lm and H4N RE are statistically significant. The best L&M dictionary is again neg lm with standard coefficient 0.0472 and t = 3.30. However, H4N RE has the highest explanatory value for volatility. Its standard coefficient (0.173) is more than three times as large as that of neg lm .
The higher effect size demonstrates that H4N RE better explains volatility than the L&M dictionaries. Again, this indicates -answering question Q1 -that automatic outperforms manual adaptation.  Table 7: Volatility regression results for L&M, RE and ADD dictionaries gesting that they are not indicative of the true relationship between volatility and negative tone in 10-Ks in this regression setup. Our results of additional manual plus automatic experiments support this observation as well. See the appendix for an illustration. Table 7 gives results for automatically adapting the L&M dictionaries. 13 For neg, the standard coefficient of neg RE is 0.0657, better by about 40% than neg lm 's standard coefficient of 0.0472. neg spec does not provide significant results and has the negative sign, i.e., an increase of negative words decreases volatility. The lit dictionaries are not significant (neither L&M nor adapted dictionaries). For unc, unc RE performs worse than unc lm , but only slightly by 0.0344 vs. 0.0356 for the standard coefficients. The overall best result is neg RE (standard coefficient 0.0657). Even though L&M designed the unc lm dictionary specifically for volatility, our results indicate that neg dictionaries perform better than unc dictionaries, both for L&M dictionaries (neg lm ) and their automatic adaptations (e.g., neg RE ). Table 7 also evaluates unc spec , the uncertainty dictionary of Theil et al. (2018). unc spec does not perform well: it is not significant and the coefficient has the "wrong" sign. 14 The main finding supported by Table 7 is that  the best automatic adaptation of an L&M dictionary gives rise to more explanatory power than the best L&M dictionary, i.e., neg RE performs better than neg lm . This again confirms our answer to Q2: we can further improve manual adaptation by automatic domain adaptation.

Qualitative Analysis
Our dictionaries outperform L&M. In this section, we perform a qualitative analysis to determine the reasons for this discrepancy in performance. Table 8 shows words from automatically adapted dictionaries. Recall that the ADD method adds words that L&M classified as nonrelevant for a category. So words like "missing" (neg), "reevaluate" (unc) and "assignors" (lit) were classified as relevant terms and seem to connote negativity, uncertainty and litigiousness, respectively, in financial contexts.
In L&M's classification scheme, a word can be part of several different categories. For instance, L&M label "unlawful", "convicted" and "breach" both as litigious and as negative. When applying our RE method, these words were only classified as negative, not as litigious. Similarly, L&M label "confusion" as negative and uncertain, but automatic RE adaptation labels it only negative. This indicates that there is strong distributional evidence in the corpus for the category negativity, but weaker distributional evidence for litigious and uncertain. For our application, only "negative" litigious/uncertain words are of interest -"acquittal" (positive litigious) and "suspense" (positive uncertain) are examples of positive words that may not help in predicting financial variables. This could explain why the negative category fares better in our adaptation than the other two.
An interesting case study for RE is "abeyance". L&M classify it as uncertain, automatic adaptation as litigious. Even though "abeyance" has a domain-general uncertain sense ("something that is waiting to be acted upon"), it is mostly used in legal contexts in 10-Ks: "held in abeyance", "appeal in abeyance". The nearest neighbors of "abeyance" in embedding space are also litigious words: "stayed", "hearings", "mediation".
H4N RE contains 74 words that are "common" in H4N. Examples include "compromise", "serious" and "god". The nearest neighbors of "compromise" in the 10-K embedding space are the negative terms "misappropriate", "breaches", "jeopardize". In a general-domain embedding space, 15 the nearest neighbors of "compromise" include "negotiated settlement", "accord" and "modus vivendi". This example suggests that "compromise" is used in 10-Ks in negative contexts and in the general domain in positive contexts. This also illustrates the importance of domain-specific word embeddings that capture domain-specific information.
Another interesting example is the word "god"; it is frequently used in 10-Ks in the phrase "act of God". Its nearest neighbors in the 10-K embedding space are "terrorism" and "war". This example clearly demonstrates that annotators are likely to make mistakes when they annotate words for sentiment without seeing their contexts. Most annotators would annotate "god" as positive, but when presented with the typical context in 10-Ks ("act of God"), they would be able to correctly classify it.
We conclude that manual annotation of words without context based on the prior belief an annotator has about word meanings is error-prone. Our automatic adaptation is performed based on the word's contexts in the target domain and therefore not susceptible to this type of error. Table 9 presents a quantitative analysis of the distribution of words over dictionaries. For a row dictionary d r and a column dictionary d c , a cell gives |d r ∩ d c |/|d r | as a percentage. (Diagonal entries are all equal to 100% and are omitted for space reasons.) For example, 49% of the words in neg lm are also members of neg RE (row "neg lm ", column "neg RE "). This analysis allows us to obtain insights into the relationship between different dictionaries and into the relationship between the categories negative, litigious and uncertain.  Looking at rows neg lm , lit lm and unc lm first, we see how L&M constructed their dictionaries. neg lm words come from H4N neg and H4N cmn in about equal proportions; i.e., many words that are "common" in ordinary usage were classified as negative by L&M for financial text. Relatively few lit lm and unc lm words are taken from H4N neg , most are from H4N cmn . Only 12% of neg lm words were automatically classified as negative in domain adaptation and assigned to H4N RE . This is a surprisingly low number. Given that H4N RE performs better than neg lm in our experiments, this statistic casts serious doubt on the ability of human annotators to correctly classify words for the type of sentiment analysis that is performed in empirical finance if the actual corpus contexts of the words are not considered. We see two types of failures in the human annotation. First, as discussed in §5.1, words like "god" are misclassified because the prevalent context in 10-Ks ("act of God") is not obvious to the annotator. Second, the utility of a word is not only a function of its sentiment, but also of the strength of this sentiment. Many words in neg lm that were deemed neutral in automatic adaptation are probably words that may be slightly negative, but that do not contribute to explaining financial variables like excess return. The strength of sentiment of a word is difficult to judge by human annotators. Looking at the row H4N RE , we see that most of its words are taken from neg lm (79%) and a few from lit lm and unc lm (2% each). We can interpret this statistic as indicating that L&M had high recall (they found most of the re-liable indicators), but low precision (see the previous paragraph: only 12% of their negative words survive in H4N RE ). The distribution of H4N RE words over H4N neg and H4N cmn is 78:22. This confirms the need for domain adaptation: many general-domain common words are negative in the financial domain.

Quantitative Analysis
We finally look at how dictionaries for negative, litigious and uncertain overlap, separately for the L&M, ADD and RE dictionaries. lit lm and unc lm have considerable overlap with neg lm (17% and 14%), but they do not overlap with each other. The three ADD dictionaries -neg ADD , lit ADD and unc ADD -do not overlap at all. As for RE, 10% of the words of unc RE are also in neg RE , otherwise there is no overlap between RE dictionaries. Comparing the original L&M dictionaries and the automatically adapted ADD and RE dictionaries, we see that the three categories -negative, litigious and uncertain -are more clearly distinguished after adaptation. L&M dictionaries overlap more, ADD and RE dictionaries overlap less.

Conclusion
In this paper, we automatically created sentiment dictionaries for predicting financial outcomes. In our experiments, we demonstrated that the automatically adapted sentiment dictionary H4N RE outperforms the previous state of the art in predicting the financial outcomes excess return and volatility. In particular, automatic adaptation performs better than manual adaptation. Our quantitative and qualitative study provided insight into the semantics of the dictionaries. We found that annotation based on an expert's a priori belief about a word's meaning can be incorrect -annotation should be performed based on the word's contexts in the target domain instead. In the future, we plan to investigate whether there are changes over time that significantly impact the linguistic characteristics of the data, in the simplest case changes in the meaning of a word. Another interesting topic for future research is the comparison of domain adaptation based on our domain-specific word embeddings vs. based on word embeddings trained on much larger corpora.