Learning to Skim Text

Recurrent Neural Networks are showing much promise in many sub-areas of natural language processing, ranging from document classification to machine translation to automatic question answering. Despite their promise, many recurrent models have to read the whole text word by word, making it slow to handle long documents. For example, it is difficult to use a recurrent network to read a book and answer questions about it. In this paper, we present an approach of reading text while skipping irrelevant information if needed. The underlying model is a recurrent network that learns how far to jump after reading a few words of the input text. We employ a standard policy gradient method to train the model to make discrete jumping decisions. In our benchmarks on four different tasks, including number prediction, sentiment analysis, news article classification and automatic Q&A, our proposed model, a modified LSTM with jumping, is up to 6 times faster than the standard sequential LSTM, while maintaining the same or even better accuracy.


Introduction
The last few years have seen much success of applying neural networks to many important applications in natural language processing, e.g., partof-speech tagging, chunking, named entity recognition (Collobert et al., 2011), sentiment analysis (Socher et al., 2011(Socher et al., , 2013, document classification (Kim, 2014;Le and Mikolov, 2014;Zhang et al., 2015;Dai and Le, 2015), machine translation (Kalchbrenner and Blunsom, 2013; Sutskever * Most of work was done when AWY was with Google. et al., 2014;Bahdanau et al., 2014;Sennrich et al., 2015;Wu et al., 2016), conversational/dialogue modeling (Sordoni et al., 2015;Shang et al., 2015), document summarization Nallapati et al., 2016), parsing (Andor et al., 2016) and automatic question answering (Q&A) Hermann et al., 2015;Wang and Jiang, 2016;Trischler et al., 2016;Lee et al., 2016;Seo et al., 2016;Xiong et al., 2016). An important characteristic of all these models is that they read all the text available to them. While it is essential for certain applications, such as machine translation, this characteristic also makes it slow to apply these models to scenarios that have long input text, such as document classification or automatic Q&A. However, the fact that texts are usually written with redundancy inspires us to think about the possibility of reading selectively.
In this paper, we consider the problem of understanding documents with partial reading, and propose a modification to the basic neural architectures that allows them to read input text with skipping. The main benefit of this approach is faster inference because it skips irrelevant information. An unexpected benefit of this approach is that it also helps the models generalize better.
In our approach, the model is a recurrent network, which learns to predict the number of jumping steps after it reads one or several input tokens. Such a discrete model is therefore not fully differentiable, but it can be trained by a standard policy gradient algorithm, where the reward can be the accuracy or its proxy during training.
In our experiments, we use the basic LSTM recurrent networks (Hochreiter and Schmidhuber, 1997) as the base model and benchmark the proposed algorithm on a range of document classification or reading comprehension tasks, using various datasets such as Rotten Tomatoes (Pang Figure 1: A synthetic example of the proposed model to process a text document. In this example, the maximum size of jump K is 5, the number of tokens read before a jump R is 2 and the number of jumps allowed N is 10. The green softmax are for jumping predictions. The processing stops if a) the jumping softmax predicts a 0 or b) the jump times exceeds N or c) the network processed the last token. We only show the case a) in this figure. and Lee, 2005), IMDB (Maas et al., 2011), AG News (Zhang et al., 2015) and Children's Book Test (Hill et al., 2015). We find that the proposed approach of selective reading speeds up the base model by two to six times. Surprisingly, we also observe our model beats the standard LSTM in terms of accuracy.
In summary, the main contribution of our work is to design an architecture that learns to skim text and show that it is both faster and more accurate in practical applications of text processing. Our model is simple and flexible enough that we anticipate it would be able to incorporate to recurrent nets with more sophisticated structures to achieve even better performance in the future.

Methodology
In this section, we introduce the proposed model named LSTM-Jump. We first describe its main structure, followed by the difficulty of estimating part of the model parameters because of nondifferentiability. To address this issue, we appeal to a reinforcement learning formulation and adopt a policy gradient method.

Model Overview
The main architecture of the proposed model is shown in Figure 1, which is based on an LSTM recurrent neural network. Before training, the number of jumps allowed N , the number of tokens read between every two jumps R and the maximum size of jumping K are chosen ahead of time. While K is a fixed parameter of the model, N and R are hyperparameters that can vary between training and testing. Also, throughout the paper, we would use d 1:p to denote a sequence d 1 , d 2 , ..., d p .
In the following, we describe in detail how the model operates when processing text. Given a training example x 1:T , the recurrent network will read the embedding of the first R tokens x 1:R and output the hidden state. Then this state is used to compute the jumping softmax that determines a distribution over the jumping steps between 1 and K. The model then samples from this distribution a jumping step, which is used to decide the next token to be read into the model. Let κ be the sampled value, then the next starting token is x R+κ . Such process continues until either a) the jump softmax samples a 0; or b) the number of jumps exceeds N ; or c) the model reaches the last token x T . After stopping, as the output, the latest hidden state is further used for predicting desired targets. How to leverage the hidden state depends on the specifics of the task at hand. For example, for classification problems in Section 3.1, 3.2 and 3.3, it is directly applied to produce a softmax for classification, while in automatic Q&A problem of Section 3.4, it is used to compute the correlation with the candidate answers in order to select the best one. Figure 1 gives an example with K = 5, R = 2 and N = 10 terminating on condition a).

Training with REINFORCE
Our goal for training is to estimate the parameters of LSTM and possibly word embedding, which are denoted as θ m , together with the jumping action parameters θ a . Once obtained, they can be used for inference.
The estimation of θ m is straightforward in the tasks that can be reduced as classification problems (which is essentially what our experiments cover), as the cross entropy objective J 1 (θ m ) is differentiable over θ m that we can directly apply backpropagation to minimize.
However, the nature of discrete jumping decisions made at every step makes it difficult to estimate θ a , as cross entropy is no longer differentiable over θ a . Therefore, we formulate it as a reinforcement learning problem and apply policy gradient method to train the model. Specifically, we need to maximize a reward function over θ a which can be constructed as follows.
Let j 1:N be the jumping action sequence during the training with an example x 1:T . Suppose h i is a hidden state of the LSTM right before the i-th jump j i , 1 then it is a function of j 1:i−1 and thus can be denoted as h i (j 1:i−1 ). Now the jump is attained by sampling from the multinomial distribution p(j i |h i (j 1:i−1 ); θ a ), which is determined by the jump softmax. We can receive a reward R after processing x 1:T under the current jumping strategy. 2 The reward should be positive if the output is favorable or non-positive otherwise. In our experiments, we choose Then the objective function of θ a we want to maximize is the expected reward under the distribution defined by the current jumping policy, i.e., J 2 (θ a ) = E p(j 1:N ;θa) [R]. (1) where p(j 1:N ; θ a ) = i p(j 1:i |h i (j 1:i−1 ); θ a ).
Optimizing this objective numerically requires computing its gradient, whose exact value is intractable to obtain as the expectation is over high dimensional interaction sequences. By running S examples, an approximated gradient can be computed by the following REINFORCE algorithm (Williams, 1992): where the superscript s denotes a quantity belonging to the s-th example. Now the term 1 The i-th jumping step is usually not xi. 2 In the general case, one may receive (discounted) intermediate rewards after each jump. But in our case, we only consider final reward. It is equivalent to a special case that all intermediate rewards are identical and without discount. ∇ θa log p(j 1:i |h i ; θ a ) can be computed by standard backpropagation.
Although the above estimation of ∇ θa J 2 (θ a ) is unbiased, it may have very high variance. One widely used remedy to reduce the variance is to subtract a baseline value b s i from the reward R s , such that the approximated gradient becomes It is shown (Williams, 1992;Zaremba and Sutskever, 2015) that any number b s i will yield an unbiased estimation. Here, we adopt the strategy of which is fully differentiable and can be solved by standard backpropagation.

Inference
During inference, we can either use sampling or greedy evaluation by selecting the most probable jumping step suggested by the jump softmax and follow that path. In the our experiments, we will adopt the sampling scheme.

Experimental Results
In this section, we present our empirical studies to understand the efficiency of the proposed model in reading text. The tasks under experimentation are: synthetic number prediction, sentiment analysis, news topic classification and automatic question answering. Those, except the first one, are representative tasks in text reading involving different sizes of datasets and various levels of text processing, from character to word and to sentence. Table 1 summarizes the statistics of the dataset in our experiments.
To exclude the potential impact of advanced models, we restrict our comparison between the vanilla LSTM (Hochreiter and Schmidhuber, 1997) and our model, which is referred to as LSTM-Jump. In a nutshell, we show that, while achieving the same or even better testing accuracy, our model is up to 6 times and 66 times faster than the baseline LSTM model in real and synthetic  datasets, respectively, as we are able to selectively skip a large fraction of text.
In fact, the proposed model can be readily extended to other recurrent neural networks with sophisticated mechanisms such as attention and/or hierarchical structure to achieve higher accuracy than those presented below. However, this is orthogonal to the main focus of this work and would be left as an interesting future work.
General Experiment Settings We use the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.001 in all experiments. We also apply gradient clipping to all the trainable variables with the threshold of 1.0. The dropout rate between the LSTM layers is 0.2 and the embedding dropout rate is 0.1. We repeat the notations N, K, R defined previously in Table 2, so readers can easily refer to when looking at Tables 4,5,6 and 7. While K is fixed during both training and testing, we would fix R and N at training but vary their values during test to see the impact of parameter changes. Note that N is essentially a constraint which can be relaxed. Yet we prefer to enforce this constraint here to let the model learn to read fewer tokens. Finally, the reported test time is measured by running one pass of the whole test set instance by instance, and the speedup is over the base LSTM model. The code is written with TensorFlow. 3

Notation
Meaning N number of jumps allowed K maximum size of jumping R number of tokens read before a jump Table 2: Notations referred to in experiments.

Number Prediction with a Synthetic Dataset
We first test whether LSTM-Jump is indeed able to learn how to jump if a very clear jumping sig-  The results of LSTM and our method, LSTM-Jump, are shown in Table 3. The first observation is that LSTM-Jump is faster than LSTM; the longer the sequence is, the more significant speedup LSTM-Jump can gain. This is because the well-trained LSTM-Jump is aware of the jumping signal at the first token and hence can directly jump to the output position to make prediction, while LSTM is agnostic to the signal and has to read the whole sequence. As a result, the reading speed of LSTM-Jump is hardly affected by the length of sequence, but that of LSTM is linear with respect to length. Besides, LSTM-Jump also outperforms LSTM in terms of test accuracy under all cases. This is not surprising either, as LSTM has to read a large amount of tokens that are potentially not helpful and could interfere with the prediction. In summary, the results indicate LSTM-Jump is able to learn to jump if the signal is clear.

Word Level Sentiment Analysis with Rotten Tomatoes and IMDB datasets
As LSTM-Jump has shown great speedups in the synthetic dataset, we would like to understand whether it could carry this benefit to real-world data, where "jumping" signal is not explicit. So in this section, we conduct sentiment analysis on two movie review datasets, both containing equal numbers of positive and negative reviews. The first dataset is Rotten Tomatoes, which contains 10,662 documents. Since there is not a standard split, we randomly select around 80% for training, 10% for validation, and 10% for testing. The average and maximum lengths of the reviews are 22 and 56 words respectively, and we pad each of them to 60. We choose the pre-trained word2vec embeddings 5 (Mikolov et al., 2013) as our fixed word embedding that we do not update this matrix during training. Both LSTM-Jump and LSTM contain 2 layers, 256 hidden units and the batch size is 100. As the amount of training data is small, we slightly augment the data by sampling a continuous 50-word sequence in each padded reviews as one training sample. During training, we enforce LSTM-Jump to read 8 tokens before a jump (R = 8), and the maximum skipping tokens per jump is 10 (K = 10), while the number of jumps allowed is 3 (N = 3).
The testing result is reported in Table 4. In a nutshell, LSTM-Jump is always faster than LSTM under different combinations of R and N . At the same time, the accuracy is on par with that of LSTM. In particular, the combination of (R, N ) = (7, 4) even achieves slightly better accuracy than LSTM while having a 1.5x speedup.  Table 4: Testing time and accuracy on the Rotten Tomatoes review classification dataset. The maximum size of jumping K is set to 10 for all the settings. The jumping level is word.
The second dataset is IMDB (Maas et al., 2011), 6 which contains 25,000 training and 25,000 testing movie reviews, where the average length of text is 240 words, much longer than that of Rotten Tomatoes. We randomly set aside about 15% of training data as validation set. Both LSTM-Jump and LSTM has one layer and 128 hidden units, and the batch size is 50. Again, we use pretrained word2vec embeddings as initialization but they are updated during training. We either pad a short sequence to 400 words or randomly select a 400word segment from a long sequence as a training example. During training, we set R = 20, K = 40 and N = 5.
As Table 5 shows, the result exhibits a similar trend as found in Rotten Tomatoes that LSTM-Jump is uniformly faster than LSTM under many settings. The various (R, N ) combinations again demonstrate the trade-off between efficiency and accuracy. If one cares more about accuracy, then allowing LSTM-Jump to read and jump more  times is a good choice. Otherwise, shrinking either one would bring a significant speedup though at the price of losing some accuracy. Nevertheless, the configuration with the highest accuracy still enjoys a 1.6x speedup compared to LSTM. With a slight loss of accuracy, LSTM-Jump can be 2.5x faster .

Character Level News Article Classification with AG dataset
We now present results on testing the character level jumping with a news article classification problem. The dataset contains four classes of topics (World, Sports, Business, Sci/Tech) from the AG's news corpus, 7 a collection of more than 1 million news articles. The data we use is the subset constructed by Zhang et al. (2015) for classification with character-level convolutional networks. There are 30,000 training and 1,900 testing examples for each class respectively, where 15% of training data is set aside as validation. The nonspace alphabet under use are: The result is summarized in Table 6. It is interesting to see that even with skipping, LSTM-Jump is not always faster than LSTM. This is mainly due to the fact that the embedding size and hidden layer are both much smaller than those used previously, and accordingly the processing of a token is much faster. In that case, other computation overhead such as calculating and sampling from the jump softmax might become a dominating factor of efficiency. By this cross-task comparison, we can see that the larger the hidden unit size of recurrent neural network and the embedding are, the more speedup LSTM-Jump can gain, which is also confirmed by the task below.

Sentence Level Automatic Question Answering with Children's Book Test dataset
The last task is automatic question answering, in which we aim to test the sentence level skimming of LSTM-Jump. We benchmark on the data set Children's Book Test (CBT) (Hill et al., 2015). 8 In each document, there are 20 contiguous sentences (context) extracted from a children's book followed by a query sentence. A word of the query is deleted and the task is to select the best fit for this position from 10 candidates. Originally, there are four types of tasks according to the part of speech of the missing word, from which, we choose the most difficult two, i.e., the name entity (NE) and common noun (CN) as our focus, since simple language models can already achieve human-level performance for the other two types . The models, LSTM or LSTM-Jump, firstly read the whole query, then the context sentences and finally output the predicted word. While LSTM reads everything, our jumping model would decide how many context sentences should skip after reading one sentence. Whenever a model finishes reading, the context and query are encoded in its hidden state h o , and the best answer from the candidate words has the same index that maximizes the following: where C ∈ R 10×d is the word embedding matrix of the 10 candidates and W ∈ R d×hidden size is a trainable weight variable. Using such bilinear form to select answer basically follows the idea of , as it is shown to have good performance. The task is now distilled to a classification problem of 10 classes.
We either truncate or pad each context sentence, such that they all have length 20. The same preprocessing is applied to the query sentences except that the length is set as 30. For both models, the number of layers is 2, the number of hidden units is 256 and the batch size is 32. Pretrained word2vec embeddings are again used and they are not adjusted during training. The maximum number of context sentences LSTM-Jump can skip per time is K = 5 while the number of total jumping is limited to N = 5. We let the model jump after reading every sentence, so R = 1 (20 words).
The result is reported in Table 7. The performance of LSTM-Jump is superior to LSTM in terms of both accuracy and efficiency under all settings in our experiments. In particular, the fastest LSTM-Jump configuration achieves a remarkable 6x speedup over LSTM, while also having respectively 1.4% and 4.4% higher accuracy in Children's Book Test -Named Entity and Children's Book Test -Common Noun.  Table 7: Testing time and accuracy on the Children's Book Test dataset. The maximum size of jumping K is set to 5 for all the settings. The jumping level is sentence.
The dominant performance of LSTM-Jump over LSTM might be interpreted as follows. After reading the query, both LSTM and LSTM-Jump know what the question is. However, LSTM still has to process the remaining 20 sentences and thus at the very end of the last sentence, the long dependency between the question and output might become weak that the prediction is hampered. On the contrary, the question can guide LSTM-Jump on how to read selectively and stop early when the answer is clear. Therefore, when it comes to the output stage, the "memory" is both fresh and uncluttered that a more accurate answer is likely to be picked.
In the following, we show two examples of how the model reads the context given a query (bold face sentences are those read by our model in the increasing order). XXXXX is the missing word we want to fill. Note that due to truncation, a few sentences might look uncompleted.
Example 1 In the first example, the exact answer appears in the context multiple times, which makes the task relatively easy, as long as the reader has captured their occurrences.
(a) Query: 'XXXXX! (b) Context: 1. said Big Klaus, and he ran off at once to Little Klaus. 2. 'Where did you get so much money from?' 3. 'Oh, that was from my horse-skin. 4. I sold it yesterday evening.' 5. 'That 's certainly a good price!' 6. said Big Klaus; and running home in great haste, he took an axe, knocked all his four 7. 'Skins! 8. skins! 9. Who will buy skins?' 10. he cried through the streets. 11. All the shoemakers and tanners came running to ask him what he wanted for them.' 12. A bushel of money for each,' said Big Klaus. 13. 'Are you mad?' 14. they all exclaimed. 15. 'Do you think we have money by the bushel?' 16. 'Skins! 17. skins! 18. Who will buy skins?' 19. he cried again, and to all who asked him what they cost, he answered,' A bushel 20. 'He is making game of us,' they said; and the shoemakers seized their yard measures and (c) Candidates: Klaus | Skins | game | haste | head | home | horses | money | price| streets (d) Answer: Skins The reading behavior might be interpreted as follows. The model tries to search for clues, and after reading sentence 8, it realizes that the most plausible answer is "Klaus" or "Skins", as they both appear twice. "Skins" is more likely to be the answer as it is followed by a "!". The model searches further to see if "Klaus!" is mentioned somewhere, but it only finds "Klaus" without "!" for the third time. After the last attempt at sentence 14, it is confident about the answer and stops to output with "Skins".
Example 2 In this example, the answer is illustrated by a word "nuisance" that does not show up in the context at all. Hence, to answer the query, the model has to understand the meaning of both the query and context and locate the synonym of "nuisance", which is not merely verbatim and thus much harder than the previous example. Nevertheless, our model is still able to make a right choice while reading much fewer sentences.
(a) Query: Yes, I call XXXXX a nuisance. (b) Context: 1. But to you and me it would have looked just as it did to Cousin Myra -a very discontented 2. "I'm awfully glad to see you, Cousin Myra, "explained Frank carefully, "and your 3. But Christmas is just a bore -a regular bore." 4. That was what Uncle Edgar called things that didn't interest him, so that Frank felt pretty sure of 5. Nevertheless, he wondered uncomfortably what made Cousin Myra smile so queerly. 6. "Why, how dreadful!" 7. she said brightly. 8. "I thought all boys and girls looked upon Christmas as the very best time in the year." 9. "We don't, "said Frank gloomily. 10. "It's just the same old thing year in and year out. 11. We know just exactly what is going to happen. 12. We even know pretty well what presents we are going to get. 13. And Christmas Day itself is always the same. 14. We'll get up in the morning , and our stockings will be full of things, and half of 15. Then there 's dinner. 16. It 's always so poky.
17. And all the uncles and aunts come to dinner -just the same old crowd, every year, and 18. Aunt Desda always says, 'Why, Frankie, how you have grown!' 19. She knows I hate to be called Frankie. 20. And after dinner they'll sit round and talk the rest of the day, and that's all. (c) Candidates: The reading behavior can be interpreted as follows. After reading the query, our model realizes that the answer should be something like a nuisance. Then it starts to process the text. Once it hits sentence 3, it may begin to consider "Christmas" as the answer, since "bore" is a synonym of "nuisance". Yet the model is not 100% sure, so it continues to read, very conservatively -it does not jump for the next three sentences. After that, the model gains more confidence on the answer "Christmas" and it makes a large jump to see if there is something that can turn over the current hypothesis. It turns out that the last-read sentence is still talking about Christmas with a negative voice. Therefore, the model stops to take "Christmas" as the output.

Related Work
Closely related to our work is the idea of learning visual attention with neural networks Sermanet et al., 2014), where a recurrent model is used to combine visual evidence at multiple fixations processed by a convolutional neural network. Similar to our approach, the model is trained end-to-end using the REINFORCE algorithm (Williams, 1992). However, a major difference between those work and ours is that we have to sample from discrete jumping distribution, while they can sample from continuous distribution such as Gaussian. The difference is mainly due to the inborn characteristics of text and image. In fact, as pointed out by , it was difficult to learn policies over more than 25 possible discrete locations.
This idea has recently been explored in the context of natural language processing applications, where the main goal is to filter irrelevant content using a small network (Choi et al., 2016). Perhaps the most closely related to our work is the concurrent work on learning to reason with reinforcement learning (Shen et al., 2016). The key difference between our work and Shen et al. (2016) is that they focus on early stopping after multiple pass of data to ensure accuracy whereas our method focuses on selective reading with single pass to enable fast processing.
The concept of "hard" attention has also been used successfully in the context of making neural network predictions more interpretable (Lei et al., 2016). The key difference between our work and Lei et al. (2016)'s method is that our method optimizes for faster inference, and is more dynamic in its jumping. Likewise is the difference between our approach and the "soft" attention approach by (Bahdanau et al., 2014). Recently, (Hahn and Keller, 2016) investigate how machine can fixate and skip words, focusing on the comparison between the behavior of machine and human, while our goal is to make reading faster. They model the probability that each single word should be read in an unsupervised way while ours directly model the probability of how many words should be skipped with supervised learning.
Our method belongs to adaptive computation of neural networks, whose idea is recently explored by (Graves, 2016;Jernite et al., 2016), where different amount of computations are allocated dynamically per time step. The main difference between our method and Graves; Jernite et al.'s methods is that our method can set the amount of computation to be exactly zero for many steps, thereby achieving faster scanning over texts. Even though our method requires policy gradient methods to train, which is a disadvantage compared to (Graves, 2016;Jernite et al., 2016), we do not find training with policy gradient methods problematic in our experiments.
At the high-level, our model can be viewed as a simplified trainable Turing machine, where the controller can move on the input tape. It is therefore related to the prior work on Neural Turing Machines  and especially its RL version (Zaremba and Sutskever, 2015). Compared to (Zaremba and Sutskever, 2015), the output tape in our method is more simple and reward signals in our problems are less sparse, which explains why our model is easy to train. It is worth noting that Zaremba and Sutskever report difficulty in using policy gradients to train their model. Our method, by skipping irrelevant content, shortens the length of recurrent networks, thereby addressing the vanishing or exploding gradients in them (Hochreiter et al., 2001). The baseline method itself, Long Short Term Memory (Hochreiter and Schmidhuber, 1997), belongs to the same category of methods. In this category, there are several recent methods that try to achieve the same goal, such as having recurrent networks that operate in different frequency (Koutnik et al., 2014) or is organized in a hierarchical fashion (Chan et al., 2015;Chung et al., 2016).
Lastly, we should point out that we are among the recent efforts that deploy reinforcement learning to the field of natural language processing, some of which have achieved encouraging results in the realm of such as neural symbolic machine (Liang et al., 2017), machine reasoning (Shen et al., 2016) and sequence generation (Ranzato et al., 2015).

Conclusions
In this paper, we focus on learning how to skim text for fast reading. In particular, we propose a "jumping" model that after reading every few tokens, it decides how many tokens should be skipped by sampling from a softmax. Such jumping behavior is modeled as a discrete decision making process, which can be trained by reinforcement learning algorithm such as REIN-FORCE. In four different tasks with six datasets (one synthetic and five real), we test the efficiency of the proposed method on various levels of text jumping, from character to word and then to sentence. The results indicate our model is several times faster than, while the accuracy is on par with the baseline LSTM model.