Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context

We know very little about how neural language models (LM) use prior linguistic context. In this paper, we investigate the role of context in an LSTM LM, through ablation studies. Specifically, we analyze the increase in perplexity when prior context words are shuffled, replaced, or dropped. On two standard datasets, Penn Treebank and WikiText-2, we find that the model is capable of using about 200 tokens of context on average, but sharply distinguishes nearby context (recent 50 tokens) from the distant history. The model is highly sensitive to the order of words within the most recent sentence, but ignores word order in the long-range context (beyond 50 tokens), suggesting the distant past is modeled only as a rough semantic field or topic. We further find that the neural caching model (Grave et al., 2017b) especially helps the LSTM to copy words from within this distant context. Overall, our analysis not only provides a better understanding of how neural LMs use their context, but also sheds light on recent success from cache-based models.


Introduction
Language models are an important component of natural language generation tasks, such as machine translation and summarization. They use context (a sequence of words) to estimate a probability distribution of the upcoming word. For several years now, neural language models (NLMs) (Graves, 2013;Jozefowicz et al., 2016;Grave et al., 2017a;Dauphin et al., 2017;Melis et al., 2018;Yang et al., 2018) have consistently outperformed classical n-gram models, an im-provement often attributed to their ability to model long-range dependencies in faraway context. Yet, how these NLMs use the context is largely unexplained.
Recent studies have begun to shed light on the information encoded by Long Short-Term Memory (LSTM) networks. They can remember sentence lengths, word identity, and word order (Adi et al., 2017), can capture some syntactic structures such as subject-verb agreement (Linzen et al., 2016), and can model certain kinds of semantic compositionality such as negation and intensification (Li et al., 2016).
However, all of the previous work studies LSTMs at the sentence level, even though they can potentially encode longer context. Our goal is to complement the prior work to provide a richer understanding of the role of context, in particular, long-range context beyond a sentence. We aim to answer the following questions: (i) How much context is used by NLMs, in terms of the number of tokens? (ii) Within this range, are nearby and long-range contexts represented differently? (iii) How do copy mechanisms help the model use different regions of context?
We investigate these questions via ablation studies on a standard LSTM language model (Merity et al., 2018) on two benchmark language modeling datasets: Penn Treebank and WikiText-2. Given a pretrained language model, we perturb the prior context in various ways at test time, to study how much the perturbed information affects model performance. Specifically, we alter the context length to study how many tokens are used, permute tokens to see if LSTMs care about word order in both local and global contexts, and drop and replace target words to test the copying abilities of LSTMs with and without an external copy mechanism, such as the neural cache (Grave et al., 2017b). The cache operates by first recording tar-get words and their context representations seen in the history, and then encouraging the model to copy a word from the past when the current context representation matches that word's recorded context vector.
We find that the LSTM is capable of using about 200 tokens of context on average, with no observable differences from changing the hyperparameter settings. Within this context range, word order is only relevant within the 20 most recent tokens or about a sentence. In the long-range context, order has almost no effect on performance, suggesting that the model maintains a high-level, rough semantic representation of faraway words. Finally, we find that LSTMs can regenerate some words seen in the nearby context, but heavily rely on the cache to help them copy words from the long-range context.

Language Modeling
Language models assign probabilities to sequences of words. In practice, the probability can be factorized using the chain rule and language models compute the conditional probability of a target word w t given its preceding context, w 1 , . . . , w t 1 . Language models are trained to minimize the negative log likelihood of the training corpus: log P (w t |w t 1 , . . . , w 1 ), and the model's performance is usually evaluated by perplexity (PP) on a held-out set: When testing the effect of ablations, we focus on comparing differences in the language model's losses (NLL) on the dev set, which is equivalent to relative improvements in perplexity.

Approach
Our goal is to investigate the effect of contextual features such as the length of context, word order and more, on LSTM performance. Thus, we use ablation analysis, during evaluation, to measure changes in model performance in the absence of certain contextual information. Typically, when testing the language model on a held-out sequence of words, all tokens prior to the target word are fed to the model; we call this the infinite-context setting. In this study, we observe the change in perplexity or NLL when the model is fed a perturbed context (w t 1 , . . . , w 1 ), at test time. refers to the perturbation function, and we experiment with perturbations such as dropping tokens, shuffling/reversing tokens, and replacing tokens with other words from the vocabulary. 1 It is important to note that we do not train the model with these perturbations. This is because the aim is to start with an LSTM that has been trained in the standard fashion, and discover how much context it uses and which features in nearby vs. long-range context are important. Hence, the mismatch in training and test is a necessary part of experiment design, and all measured losses are upper bounds which would likely be lower, were the model also trained to handle such perturbations.
We use a standard LSTM language model, trained and finetuned using the Averaging SGD optimizer (Merity et al., 2018). 2 We also augment the model with a cache only for Section 6.2, in order to investigate why an external copy mechanism is helpful. A short description of the architecture and a detailed list of hyperparameters is listed in Appendix A, and we refer the reader to the original paper for additional details.
We analyze two datasets commonly used for language modeling, Penn Treebank (PTB) (Marcus et al., 1993;Mikolov et al., 2010) and Wikitext-2 (Wiki) (Merity et al., 2017). PTB consists of Wall Street Journal news articles with 0.9M tokens for training and a 10K vocabulary. Wiki is a larger and more diverse dataset, containing Wikipedia articles across many topics with 2.1M tokens for training and a 33K vocabulary. Additional dataset statistics are provided in Ta-1 Code for our experiments available at https:// github.com/urvashik/lm-context-analysis 2 Public release of their code at https://github. com/salesforce/awd-lstm-lm ble 1.
In this paper, we present results only on the dev sets, in order to avoid revealing details about the test sets. However, we have confirmed that all results are consistent with those on the test sets. In addition, for all experiments we report averaged results from three models trained with different random seeds. Some of the figures provided contain trends from only one of the two datasets and the corresponding figures for the other dataset are provided in Appendix B.
4 How much context is used?
LSTMs are designed to capture long-range dependencies in sequences (Hochreiter and Schmidhuber, 1997). In practice, LSTM language models are provided an infinite amount of prior context, which is as long as the test sequence goes. However, it is unclear how much of this history has a direct impact on model performance. In this section, we investigate how many tokens of context achieve a similar loss (or 1-2% difference in model perplexity) to providing the model infinite context. We consider this the effective context size.
LSTM language models have an effective context size of about 200 tokens on average. We determine the effective context size by varying the number of tokens fed to the model. In particular, at test time, we feed the model the most recent n tokens: truncate (w t 1 , . . . , w 1 ) = (w t 1 , . . . , w t n ), (1) where n > 0 and all tokens farther away from the target w t are dropped. 3 We compare the dev loss (NLL) from truncated context, to that of the infinite-context setting where all previous words are fed to the model. The resulting increase in loss indicates how important the dropped tokens are for the model. Figure 1a shows that the difference in dev loss, between truncated-and infinite-context variants of the test setting, gradually diminishes as we increase n from 5 tokens to 1000 tokens. In particular, we only see a 1% increase in perplexity as we move beyond a context of 150 tokens on PTB and 250 tokens on Wiki. Hence, we provide empirical evidence to show that LSTM language models do, in fact, model long-range dependencies, without help from extra context vectors or caches.
Changing hyperparameters does not change the effective context size. NLM performance has been shown to be sensitive to hyperparameters such as the dropout rate and model size (Melis et al., 2018). To investigate if these hyperparameters affect the effective context size as well, we train separate models by varying the following hyperparameters one at a time: (1) number of timesteps for truncated back-propogation (2) dropout rate, (3) model size (hidden state size, number of layers, and word embedding size). In Figure 1b, we show that while different hyperparameter settings result in different perplexities in the infinite-context setting, the trend of how perplexity changes as we reduce the context size remains the same.

Do different types of words need different amounts of context?
The effective context size determined in the previous section is aggregated over the entire corpus, which ignores the type of the upcoming word. Boyd-Graber and Blei (2009) have previously investigated the differences in context used by different types of words and found that function words rely on less context than content words. We investigate whether the effective context size varies across different types of words, by categorizing them based on either frequency or parts-ofspeech. Specifically, we vary the number of context tokens in the same way as the previous section, and aggregate loss over words within each class separately.
Infrequent words need more context than frequent words. We categorize words that appear at least 800 times in the training set as frequent, and the rest as infrequent. Figure 1c shows that the loss of frequent words is insensitive to missing context beyond the 50 most recent tokens, which holds across the two datasets. Infrequent words, on the other hand, require more than 200 tokens.
Content words need more context than function words. Given the parts-of-speech of each word, we define content words as nouns, verbs and adjectives, and function words as prepositions and determiners. 4 Figure 1d shows that the loss of nouns and verbs is affected by distant context, whereas when the target word is a determiner, the model only relies on words within the last 10 tokens.  Discussion. Overall, we find that the model's effective context size is dynamic. It depends on the target word, which is consistent with what we know about language, e.g., determiners require less context than nouns (Boyd-Graber and Blei, 2009). In addition, these findings are consistent with those previously reported for different language models and datasets (Hill et al., 2016;Wang and Cho, 2016).

Nearby vs. long-range context
An effective context size of 200 tokens allows for representing linguistic information at many levels of abstraction, such as words, sentences, topics, etc. In this section, we investigate the importance of contextual information such as word order and word identity. Unlike prior work that studies LSTM embeddings at the sentence level, we look at both nearby and faraway context, and analyze how the language model treats contextual information presented in different regions of the context.

Does word order matter?
Adi et al. (2017) have shown that LSTMs are aware of word order within a sentence. We investigate whether LSTM language models are sensitive to word order within a larger context window. To determine the range in which word order affects model performance, we permute substrings in the context to observe their effect on dev loss compared to the unperturbed baseline. In particular, we perturb the context as follows, permute (w t 1 , . . . , w t n ) = (w t 1 , .., ⇢(w t s 1 1 , .., w t s 2 ), .., w t n ) where ⇢ 2 {shu✏e, reverse} and (s 1 , s 2 ] denotes the range of the substring to be permuted. We refer to this substring as the permutable span. For  the following analysis, we distinguish local word order, within 20-token permutable spans which are the length of an average sentence, from global word order, which extends beyond local spans to include all the farthest tokens in the history. We consider selecting permutable spans within a context of n = 300 tokens, which is greater than the effective context size.
Local word order only matters for the most recent 20 tokens. We can locate the region of context beyond which the local word order has no relevance, by permuting word order locally at various points within the context. We accomplish this by varying s 1 and setting s 2 = s 1 + 20. Figure 2a shows that local word order matters very much within the most recent 20 tokens, and far less beyond that.
Global order of words only matters for the most recent 50 tokens. Similar to the local word order experiment, we locate the point beyond which the general location of words within the context is irrelevant, by permuting global word order. We achieve this by varying s 1 and fixing s 2 = n. Figure 2b demonstrates that after 50 tokens, shuffling or reversing the remaining words in the context has no effect on the model performance.
In order to determine whether this is due to insensitivity to word order or whether the language model is simply not sensitive to any changes in the long-range context, we further replace words in the permutable span with a randomly sampled sequence of the same length from the training set. The gap between the permutation and replacement curves in Figure 2b illustrates that the identity of words in the far away context is still relevant, and only the order of the words is not.
Discussion. These results suggest that word order matters only within the most recent sentence, beyond which the order of sentences matters for 2-3 sentences (determined by our experiments on global word order). After 50 tokens, word order has almost no effect, but the identity of those words is still relevant, suggesting a high-level, rough semantic representation for these faraway words. In light of these observations, we define 50 tokens as the boundary between nearby and longrange context, for the rest of this study. Next, we investigate the importance of different word types in the different regions of context.

Types of words and the region of context
Open-class or content words such as nouns, verbs, adjectives and adverbs, contribute more to the semantic context of natural language than function words such as determiners and prepositions. Given our observation that the language model represents long-range context as a rough semantic representation, a natural question to ask is how important are function words in the long-range Figure 3: Effect of dropping content and function words from 300 tokens of context relative to an unperturbed baseline, on PTB. Error bars represent 95% confidence intervals. Dropping both content and function words 5 tokens away from the target results in a nontrivial increase in loss, whereas beyond 20 tokens, only content words are relevant.
context? Below, we study the effect of these two classes of words on the model's performance. Function words are defined as all words that are not nouns, verbs, adjectives or adverbs.
Content words matter more than function words. To study the effect of content and function words on model perplexity, we drop them from different regions of the context and compare the resulting change in loss. Specifically, we perturb the context as follows, drop (w t 1 , . . . , w t n ) = (w t 1 , .., w t s 1 , f pos (y, (w t s 1 1 , .., w t n ))) where f pos (y, span) is a function that drops all words with POS tag y in a given span. s 1 denotes the starting offset of the perturbed subsequence. For these experiments, we set s 1 2 {5, 20, 100}. On average, there are slightly more content words than function words in any given text. As shown in Section 4, dropping more words results in higher loss. To eliminate the effect of dropping different fractions of words, for each experiment where we drop a specific word type, we add a control experiment where the same number of tokens are sampled randomly from the context, and dropped. Figure 3 shows that dropping content words as close as 5 tokens from the target word increases model perplexity by about 65%, whereas dropping the same proportion of tokens at random, results in a much smaller 17% increase. Dropping all function words, on the other hand, is not very different from dropping the same proportion of words at random, but still increases loss by about 15%. This suggests that within the most recent sentence, content words are extremely important but function words are also relevant since they help maintain grammaticality and syntactic structure. On the other hand, beyond a sentence, only content words have a sizeable influence on model performance.
6 To cache or not to cache?
As shown in Section 5.1, LSTM language models use a high-level, rough semantic representation for long-range context, suggesting that they might not be using information from any specific words located far away. Adi et al. (2017) have also shown that while LSTMs are aware of which words appear in their context, this awareness degrades with increasing length of the sequence. However, the success of copy mechanisms such as attention and caching (Bahdanau et al., 2015;Hill et al., 2016;Merity et al., 2017;Grave et al., 2017a,b) suggests that information in the distant context is very useful. Given this fact, can LSTMs copy any words from context without relying on external copy mechanisms? Do they copy words from nearby and long-range context equally? How does the caching model help? In this section, we investigate these questions by studying how LSTMs copy words from different regions of context. More specifically, we look at two regions of context, nearby (within 50 most recent tokens) and longrange (beyond 50 tokens), and study three categories of target words: those that can be copied from nearby context (C near ), those that can only be copied from long-range context (C far ), and those that cannot be copied at all given a limited context (C none ).

Can LSTMs copy words without caches?
Even without a cache, LSTMs often regenerate words that have already appeared in prior context. We investigate how much the model relies on the previous occurrences of the upcoming target word, by analyzing the change in loss after dropping and replacing this target word in the context.
LSTMs can regenerate words seen in nearby context. In order to demonstrate the usefulness (a) Dropping tokens (b) Perturbing occurrences of target word in context. Figure 4: Effects of perturbing the target word in the context compared to dropping long-range context altogether, on PTB. Error bars represent 95% confidence intervals. (a) Words that can only be copied from long-range context are more sensitive to dropping all the distant words than to dropping the target. For words that can be copied from nearby context, dropping only the target has a much larger effect on loss compared to dropping the long-range context. (b) Replacing the target word with other tokens from vocabulary hurts more than dropping it from the context, for words that can be copied from nearby context, but has no effect on words that can only be copied from far away.
of target word occurrences in context, we experiment with dropping all the distant context versus dropping only occurrences of the target word from the context. In particular, we compare removing all tokens after the 50 most recent tokens, (Equation 1 with n = 50), versus removing only the target word, in context of size n = 300: where f word (w, span) drops words equal to w in a given span. We compare applying both perturbations to a baseline model with unperturbed context restricted to n = 300. We also include the target words that never appear in the context (C none ) as a control set for this experiment.
The results show that LSTMs rely on the rough semantic representation of the faraway context to generate C far , but direclty copy C near from the nearby context. In Figure 4a, the long-range context bars show that for words that can only be copied from long-range context (C far ), removing all distant context is far more disruptive than removing only occurrences of the target word (12% and 2% increase in perplexity, respectively). This suggests that the model relies more on the rough semantic representation of faraway context to predict these C far tokens, rather than directly copying them from the distant context. On the other hand, for words that can be copied from nearby context (C near ), removing all long-range context has a smaller effect (about 3.5% increase in perplexity) as seen in Figure 4a, compared to removing the target word which increases perplexity by almost 9%. This suggests that these C near tokens are more often copied from nearby context, than inferred from information found in the rough semantic representation of long-range context. However, is it possible that dropping the target tokens altogether, hurts the model too much by adversely affecting grammaticality of the context? We test this theory by replacing target words in the context with other words from the vocabulary. This perturbation is similar to Equation 4, except instead of dropping the token, we replace it with a different one. In particular, we experiment with replacing the target with <unk>, to see if having the generic word is better than not having any word. We also replace it with a word that has the same part-of-speech tag and a similar frequency in the dataset, to observe how much this change confuses the model. Figure 4b shows that replacing the target with other words results in up to a 14% increase in perplexity for C near , which suggests that the replacement token seems to confuse the model far more than when the token is simply dropped. However, the words that rely on the long-range context, C far , are largely unaffected by these changes, which confirms our conclusion from dropping the target tokens: C far witnesses in the morris film </s> served up as a solo however the music lacks the UNK provided by a context within another medium </s> UNK of mr. glass may agree with the critic richard UNK 's sense that the NUM music in twelve parts is as UNK and UNK as the UNK UNK </s> but while making the obvious point that both UNK develop variations from themes this comparison UNK the intensely UNK nature of mr. glass </s> snack-food UNK increased a strong NUM NUM in the third quarter while domestic profit increased in double UNK mr. calloway said </s> excluding the british snack-food business acquired in july snack-food international UNK jumped NUM NUM with sales strong in spain mexico and brazil </s> total snack-food profit rose NUM NUM </s> led by pizza hut and UNK bell restaurant earnings increased about NUM NUM in the third quarter on a NUM NUM sales increase </s> UNK sales for pizza hut rose about NUM NUM while UNK bell 's increased NUM NUM as the chain continues to benefit from its UNK strategy </s> UNK bell has turned around declining customer counts by permanently lowering the price of its UNK </s> same UNK for kentucky fried chicken which has struggled with increased competition in the fast-food chicken market and a lack of new products rose only NUM NUM </s> the operation which has been slow to respond to consumers ' shifting UNK away from fried foods has been developing a UNK product that may be introduced nationally at the end of next year </s> the new product has performed well in a market test in las vegas nev. mr. calloway send a delegation of congressional staffers to poland to assist its legislature the UNK in democratic procedures </s> senator pete UNK calls this effort the first gift of democracy </s> the poles might do better to view it as a UNK horse </s> it is the vast shadow government of NUM congressional staffers that helps create such legislative UNK as the NUM page UNK reconciliation bill that claimed to be the budget of the united states </s> maybe after the staffers explain their work to the poles they 'd be willing to come back and do the same for the american people </s> UNK UNK plc a financially troubled irish maker of fine crystal and UNK china reported that its pretax loss for the first six months widened to NUM million irish punts $ NUM million from NUM million irish punts a year earlier </s> the results for the half were worse than market expectations which suggested an interim loss of around NUM million irish punts </s> in a sharply weaker london market yesterday UNK shares were down NUM pence at NUM pence NUM cents </s> the company reported a loss after taxation and minority interests of NUM million irish sim has set a fresh target of $ NUM a share by the end of </s> reaching that goal says robert t. UNK applied 's chief financial officer will require efficient reinvestment of cash by applied and UNK of its healthy NUM NUM rate of return on operating capital </s> in barry wright mr. sim sees a situation very similar to the one he faced when he joined applied as president and chief operating officer in NUM </s> applied then a closely held company was UNK under the management of its controlling family </s> while profitable it was n't growing and was n't providing a satisfactory return on invested capital he says </s> mr. sim is confident that the drive to dominate certain niche markets will work at barry wright as it has at applied </s> he also UNK an UNK UNK to develop a corporate culture that rewards managers who produce and where UNK is shared </s> mr. sim considers the new unit 's operations fundamentally sound and adds that barry wright has been fairly successful in moving into markets that have n't interested larger competitors </s> with a little patience these businesses will perform very UNK mr. sim was openly sympathetic to swapo </s> shortly after that mr. UNK had scott stanley arrested and his UNK confiscated </s> mr. stanley is on trial over charges that he violated a UNK issued by the south african administrator general earlier this year which made it a crime punishable by two years in prison for any person to UNK UNK or UNK the election commission </s> the stanley affair does n't UNK well for the future of democracy or freedom of anything in namibia when swapo starts running the government </s> to the extent mr. stanley has done anything wrong it may be that he is out of step with the consensus of world intellectuals that the UNK guerrillas were above all else the victims of UNK by neighboring south africa </s> swapo has enjoyed favorable western media treatment ever since the u.n. general assembly declared it the sole UNK representative of namibia 's people in </s> last year the u.s. UNK a peace settlement to remove cuba 's UNK UNK from UNK and hold free and fair elections that would end south africa 's control of namibia </s> the elections are set for nov. NUM </s> in july mr. stanley july snack-food international UNK jumped NUM NUM with sales strong in spain mexico and brazil </s> total snack-food profit rose NUM NUM </s> led by pizza hut and UNK bell restaurant earnings increased about NUM NUM in the third quarter on a NUM NUM sales increase </s> UNK sales for pizza hut rose about NUM NUM while UNK bell 's increased NUM NUM as the chain continues to benefit from its UNK strategy </s> UNK bell has turned around declining customer counts by permanently lowering the price of its UNK </s> same UNK for kentucky fried chicken which has struggled with increased competition in the fast-food chicken market and a lack of new products rose only NUM NUM </s> the operation which has been slow to respond to consumers ' shifting UNK away from fried foods has been developing a UNK product that may be introduced nationally at the end of next year </s> the new product has performed well in a market test in las vegas nev. mr. calloway said </s> after a four-year $ NUM billion acquisition binge that brought a major soft-drink company soda UNK a fast-food chain and an overseas snack-food giant to pepsi mr. calloway of london 's securities traders it was a day that started nervously in the small hours </s> by UNK the selling was at UNK fever </s> but as the day ended in a UNK wall UNK rally the city UNK a sigh of relief </s> so it went yesterday in the trading rooms of london 's financial district </s> in the wake of wall street 's plunge last friday the london market was considered especially vulnerable </s> and before the opening of trading here yesterday all eyes were on early trading in tokyo for a clue as to how widespread the fallout Figure 5: Success of neural cache on PTB. Brightly shaded region shows peaky distribution. management equity participation </s> further many institutions today holding troubled retailers ' debt securities will be UNK to consider additional retailing investments </s> it 's called bad money driving out good money said one retailing UNK </s> institutions that usually buy retail paper have to be more concerned </s> however the lower prices these retail chains are now expected to bring should make it easier for managers to raise the necessary capital and pay back the resulting debt </s> in addition the fall selling season has generally been a good one especially for those retailers dependent on apparel sales for the majority of their revenues </s> what 's encouraging about this is that retail chains will be sold on the basis of their sales and earnings not liquidation values said joseph e. brooks chairman and chief offerings outside the u.s. </s> goldman sachs & co. will manage the offering </s> macmillan said berlitz intends to pay quarterly dividends on the stock </s> the company said it expects to pay the first dividend of NUM cents a share in the NUM first quarter </s> berlitz will borrow an amount equal to its expected net proceeds from the offerings plus $ NUM million in connection with a credit agreement with lenders </s> the total borrowing will be about $ NUM million the company said </s> proceeds from the borrowings under the credit agreement will be used to pay an $ NUM million cash dividend to macmillan and to lend the remainder of about $ NUM million to maxwell communications in connection with a UNK note </s> proceeds from the offering will be used to repay borrowings under the short-term parts of a credit agreement </s> berlitz which is based in princeton n.j. provides language instruction and translation services through more than NUM language centers in NUM countries </s> in the past five years more than NUM NUM of its sales have been outside the u.s. </s> macmillan has owned berlitz since NUM </s> in the first six months said that despite losses on ual stock his firm 's health is excellent </s> the stock 's decline also has left the ual board in a UNK </s> although it may not be legally obligated to sell the company if the buy-out group ca n't revive its bid it may have to explore alternatives if the buyers come back with a bid much lower than the group 's original $ 300-a-share proposal </s> at a meeting sept. NUM to consider the labor-management bid the board also was informed by its investment adviser first boston corp. of interest expressed by buy-out funds including kohlberg kravis roberts & co. and UNK little & co. as well as by robert bass morgan stanley 's buy-out fund and pan am corp </s> the takeover-stock traders were hoping that mr. davis or one of the other interested parties might UNK with the situation in disarray or that the board might consider a recapitalization </s> meanwhile japanese bankers said they were still UNK about accepting citicorp 's latest proposal </s> macmillan inc. said it plans a public offering of NUM million shares of its berlitz international inc. unit at $ NUM to $ NUM a share capital markets to sell its hertz equipment rental corp. unit </s> there is no pressing need to sell the unit but we are doing it so we can concentrate on our core business UNK automobiles in the u.s. and abroad said william UNK hertz 's executive vice president </s> we are only going to sell at the right price </s> hertz equipment had operating profit before depreciation of $ NUM million on revenue of $ NUM million in NUM </s> the closely held hertz corp. had annual revenue of close to $ NUM billion in NUM of which $ NUM billion was contributed by its hertz rent a car operations world-wide </s> hertz equipment is a major supplier of rental equipment in the u.s. france spain and the UNK </s> it supplies commercial and industrial equipment including UNK UNK UNK and electrical equipment UNK UNK UNK and trucks </s> UNK inc. reported a net loss of $ NUM million for the fiscal third quarter ended aug. NUM </s> it said the loss resulted from UNK and introduction costs related to a new medical UNK equipment system </s> in the year-earlier quarter the company reported net income of $ NUM or acquisition of nine businesses that make up the group the biggest portion of which was related to the NUM purchase of a UNK co. unit </s> among other things the restructured facilities will substantially reduce the group 's required amortization of the term loan portion of the credit facilities through september NUM mlx said </s> certain details of the restructured facilities remain to be negotiated </s> the agreement is subject to completion of a definitive amendment and appropriate approvals </s> william p. UNK mlx chairman and chief executive said the pact will provide mlx with the additional time and flexibility necessary to complete the restructuring of the company 's capital structure </s> mlx has filed a registration statement with the securities and exchange commission covering a proposed offering of $ NUM million in long-term senior subordinated notes and warrants </s> dow jones & co. said it acquired a NUM NUM interest in UNK corp. a subsidiary of oklahoma publishing co. oklahoma city that provides electronic research services </s> terms were n't disclosed </s> customers of either UNK or dow jones UNK are able to access the information on both services </s> dow jones is the publisher of the wall street video games electronic information systems and playing cards posted a NUM NUM unconsolidated surge in pretax profit to NUM billion yen $ NUM million from NUM billion yen $ NUM million for the fiscal year ended aug. NUM </s> sales surged NUM NUM to NUM billion yen from NUM billion </s> net income rose NUM NUM to NUM billion yen from NUM billion </s> UNK net fell to NUM yen from NUM yen because of expenses and capital adjustments </s> without detailing specific product UNK UNK credited its bullish UNK in sales including advanced computer games and television entertainment systems to surging UNK sales in foreign markets </s> export sales for leisure items alone for instance totaled NUM billion yen in the NUM months up from NUM billion in the previous fiscal year </s> domestic leisure sales however were lower </s> hertz corp. of park UNK n.j. said it retained merrill lynch capital markets to sell its hertz equipment rental corp. unit </s> there is no pressing need to sell the unit but we are doing it so we can concentrate on our core business UNK automobiles in the u.s. and abroad said william UNK hertz 's executive vice president so-called road show to market the package around the world </s> an increasing number of banks appear to be considering the option words are predicted from the rough representation of faraway context instead of specific occurrences of certain words.

How does the cache help?
If LSTMs can already regenerate words from nearby context, how are copy mechanisms helping the model? We answer this question by analyzing how the neural cache model (Grave et al., 2017b) helps with improving model performance.
The cache records the hidden state h t at each timestep t, and computes a cache distribution over the words in the history as follows: where ✓ controls the flatness of the distribution. This cache distribution is then interpolated with the model's output distribution over the vocabulary. Consequently, certain words from the history are upweighted, encouraging the model to copy them.
Caches help words that can be copied from long-range context the most. In order to study the effectiveness of the cache for the three classes of words (C near , C far , C none ), we evaluate an LSTM language model with and without a cache, and measure the difference in perplexity for these words. In both settings, the model is provided all prior context (not just 300 tokens) in or- Figure 7: Model performance relative to using a cache. Error bars represent 95% confidence intervals. Words that can only be copied from the distant context benefit the most from using a cache.
der to replicate the Grave et al. (2017b) setup. The amount of history recorded, known as the cache size, is a hyperparameter set to 500 past timesteps for PTB and 3,875 for Wiki, both values very similar to the average document lengths in the respective datasets. We find that the cache helps words that can only be copied from long-range context (C far ) more than words that can be copied from nearby (C near ). This is illustrated by Figure 7 where without caching, C near words see a 22% increase in perplexity for PTB, and a 32% increase for Wiki, whereas C far see a 28% increase in perplexity for PTB, and a whopping 53% increase for Wiki. Thus, the cache is, in a sense, complementary to the standard model, since it especially helps regenerate words from the long-range context where the latter falls short.
However, the cache also hurts about 36% of the words in PTB and 20% in Wiki, which are words that cannot be copied from context (C none ), as illustrated by bars for "none" in Figure 7. We also provide some case studies showing success (Fig. 5) and failure (Fig. 6) modes for the cache. We find that for the successful case, the cache distribution is concentrated on a single word that it wants to copy. However, when the target is not present in the history, the cache distribution is more flat, illustrating the model's confusion, shown in Figure 6. This suggests that the neural cache model might benefit from having the option to ignore the cache when it cannot make a confident choice.

Discussion
The findings presented in this paper provide a great deal of insight into how LSTMs model context. This information can prove extremely useful for improving language models. For instance, the discovery that some word types are more important than others can help refine word dropout strategies by making them adaptive to the different word types. Results on the cache also show that we can further improve performance by allowing the model to ignore the cache distribution when it is extremely uncertain, such as in Figure 6. Differences in nearby vs. long-range context suggest that memory models, which feed explicit context representations to the LSTM (Ghosh et al., 2016;Lau et al., 2017), could benefit from representations that specifically capture information orthogonal to that modeled by the LSTM.
In addition, the empirical methods used in this study are model-agnostic and can generalize to models other than the standard LSTM. This opens the path to generating a stronger understanding of model classes beyond test set perplexities, by comparing them across additional axes of information such as how much context they use on average, or how robust they are to shuffled contexts.
Given the empirical nature of this study and the fact that the model and data are tightly coupled, separating model behavior from language characteristics, has proved challenging. More specifically, a number of confounding factors such as vocabulary size, dataset size etc. make this separation difficult. In an attempt to address this, we have chosen PTB and Wiki -two standard language modeling datasets which are diverse in con-tent (news vs. factual articles) and writing style, and are structured differently (eg: Wiki articles are 4-6x longer on average and contain extra information such as titles and paragraph/section markers). Making the data sources diverse in nature, has provided the opportunity to somewhat isolate effects of the model, while ensuring consistency in results. An interesting extension to further study this separation would lie in experimenting with different model classes and even different languages.
Recently, Chelba et al. (2017), in proposing a new model, showed that on PTB, an LSTM language model with 13 tokens of context is similar to the infinite-context LSTM performance, with close to an 8% 5 increase in perplexity. This is compared to a 25% increase at 13 tokens of context in our setup. We believe this difference is attributed to the fact that their model was trained with restricted context and a different error propagation scheme, while ours is not. Further investigation would be an interesting direction for future work.

Conclusion
In this analytic study, we have empirically shown that a standard LSTM language model can effectively use about 200 tokens of context on two benchmark datasets, regardless of hyperparameter settings such as model size. It is sensitive to word order in the nearby context, but less so in the long-range context. In addition, the model is able to regenerate words from nearby context, but heavily relies on caches to copy words from far away. These findings not only help us better understand these models but also suggest ways for improving them, as discussed in Section 7. While observations in this paper are reported at the token level, deeper understanding of sentence-level interactions warrants further investigation, which we leave to future work.