Why and when should you pool? Analyzing Pooling in Recurrent Architectures

Pooling-based recurrent neural architectures consistently outperform their counterparts without pooling on sequence classification tasks. However, the reasons for their enhanced performance are largely unexamined. In this work, we examine three commonly used pooling techniques (mean-pooling, max-pooling, and attention, and propose *max-attention*, a novel variant that captures interactions among predictive tokens in a sentence. Using novel experiments, we demonstrate that pooling architectures substantially differ from their non-pooling equivalents in their learning ability and positional biases: (i) pooling facilitates better gradient flow than BiLSTMs in initial training epochs, and (ii) BiLSTMs are biased towards tokens at the beginning and end of the input, whereas pooling alleviates this bias. Consequently, we find that pooling yields large gains in low resource scenarios, and instances when salient words lie towards the middle of the input. Across several text classification tasks, we find max-attention to frequently outperform other pooling techniques.


Introduction
Pooling mechanisms are ubiquitous components in Recurrent Neural Networks (RNNs) used for natural language tasks. Pooling operations consolidate hidden representations from RNNs into a single sentence representation. Various pooling techniques, like mean-pooling, max-pooling, and attention, have been shown to improve the performance of RNNs on text classification tasks (Lai et al., 2015;Conneau et al., 2017). Despite widespread adoption, precisely how and when pooling benefits the models is largely under-explored.
In this work, we perform an in-depth analysis comparing popular pooling methods, and proposed max-attention, with standard BiLSTMs for several text classification tasks. We identify two key factors that explain the benefits of pooling techniques: learnability, and positional invariance.
First, we analyze the flow of gradients for different classification tasks to assess the learning ability of BiLSTMs ( § 5). We observe that the gradients corresponding to hidden representations in the middle of the sequence vanish during the initial epochs. On training for more examples, these gradients slowly recover, suggesting that the gates of standard BiLSTMs require many examples to learn. In contrast, we find the gradient norms in poolingbased architectures to be free from this problem. Pooling enables a fraction of the gradients to directly reach any hidden state instead of having to backpropagate through a long series of recurrent cells. Thus we hypothesize, and subsequently confirm, that pooling is particularly beneficial for tasks with long input sequences.
Second, we explore the positional biases of BiL-STMs, with and without pooling ( § 6). Across several classification tasks, and various novel experimental setups, we expose that BiLSTMs are less responsive to tokens towards the middle of the sequence, when compared to tokens at the beginning or the end of the sequence. However, we find that this bias is largely absent in pooling-based architectures, indicating their ability to respond to salient tokens regardless of their position.
Third, we propose max-attention, a novel pooling technique, which combines the advantages of max-pooling and attention ( § 3.2). Max-attention uses the max-pooled representation as its query vector to compute the attention weights for each hidden state. Max-pooled representations are extensively used in the literature to capture prominent tokens (or objects) in a sentence (or an im-age) (Zhang and Wallace, 2015;Boureau et al., 2010b). Therefore, using them as a query vector effectively captures interactions among salient portions in the input. Max-attention is simple to use, and yields performance gains over other pooling methods on several classification setups.

Related Work
Pooling: A wide body of work compares the performance of different pooling techniques in object recognition tasks (Boureau et al., 2010a(Boureau et al., ,b, 2011 and finds max-pooling to generally outperform mean-pooling. However, pooling in natural language tasks is relatively understudied. For some text classification tasks, pooled recurrent architectures (Lai et al., 2015;Zhang and Wallace, 2015;Johnson and Zhang, 2016;Jacovi et al., 2018;Yang et al., 2016a), outperform CNNs and BiLSTMs. Additionally, for textual entailment tasks, Conneau et al. (2017) find that max-pooled representations better capture salient words in a sentence. Our work extends the analysis and examines several pooling techniques, including attention, for BiL-STMs applied to natural language tasks. While past approaches assess the ability of pooling in capturing linguistic phenomena, to the best of our knowledge, we are the first to systematically study the training advantages of various pooling techniques.
Attention: First proposed as a way to align target tokens to the source tokens in translation (Bahdanau et al., 2014), the core idea behind attentionlearning a weighted sum of the hidden states-has been widely adopted. As attention aggregates hidden representations, we consider it under the umbrella of pooling. Recently, Pruthi et al. (2020) conjecture that attention offers benefits during training; our work explains, and provides empirical evidence to support the speculation.
Gradient Propagation: Vanilla RNNs are known to suffer from the problem of vanishing and exploding gradients (Hochreiter, 1991;Bengio et al., 1994). In response, Hochreiter and Schmidhuber (1997) invented LSTMs, which provide a direct connection passage through all the cells in order to remember new inputs without forgetting prior history. However, recent work suggests that LSTMs do not solve this problem completely (Arjovsky et al., 2015;Chandar et al., 2019). Our work quantitatively investigates this phenomenon, exposing scenarios where the effect is pronounced, and demonstrating how pooling techniques mitigate the problem, leading to better sample efficiency, and generalization.

Background and Notation
Let s = {x 1 , x 2 , . . . , x n } be an input sentence, where x t is a representation of the input word at position t. A recurrent neural network such as an LSTM produces a hidden state h t , and a cell state c t for each input word x t , where h t , c t = φ(h t−1 , c t−1 , x t ). Standard BiLSTMs concatenate the first hidden state of the backward LSTM, and the last hidden state of the forward LSTM for the final sentence representation: The sentence embedding (s emb ) is further fed to a downstream text classifier. For training BiLSTMs, multiple works have emphasized the importance of initializing the bias for forget gates to a high value (between 1-2) to prevent the model from forgetting information before it learns what to forget (Gers et al., 2000;van der Westhuizen and Lasenby, 2018). Hence, in our analysis, we experiment with both a high and low value of bias for the forget gate. For the non-pooled BiLSTM, we initialize the forget gate bias to 1, unless specified. For brevity, from hereon we would use h t to mean Below, we formally discuss popular pooling techniques: Max-pooling: For a max-pooled BiLSTM (MAXPOOL), the sentence embedding s emb , is: where h i t represents the i th dimension of the hidden state corresponding to the word at position t. This implies that while backpropagating the loss, we find a direct pathway to the t th hidden state as: Similarly, in mean-pooling (MEANPOOL), s emb is an average over all the hidden states. 3 Attention: Attention (ATT) works by calculating a non-negative weight for each hidden state that together sum to 1. Hidden representations are then Figure 1: A pictorial overview of the pooling techniques. Left: element-wise mean and max pooling operations aggregate hidden representations. Right: attention scores (α) are computed using the similarity between hidden representations (h) and query vector (q), which are subsequently used to weight hidden representations. Our proposed max-attention uses the sentence embedding from max-pooling as a query to attend over hidden states. multiplied with these weights and summed, resulting in a fixed-length vector (Bahdanau et al., 2014;Luong et al., 2015):

Max-attention
We introduce a novel pooling variant called maxattention (MAXATT) to capture inter-word dependencies. It uses the max-pooled hidden representation as the query vector for attention. Formally: It is worth noting that the learnable query vector in Luong attention is the same for the entire corpus, whereas in max-attention each sentence has a unique locally-informed query. Previous literature extensively uses max-pooling to capture the prominent tokens (or objects) in a sentence (or image). Hence, using max-pooled representation as a query for attention allows for a second round of aggregation among important hidden representations.

Transformers
We briefly experiment with transformer architectures (Vaswani et al., 2017;Devlin et al., 2018), and observe that purely attention-based architectures perform poorly on text-classification without significant pre-training. Further, the memory footprint for transformers is O(n 2 ) vs O(n) for LSTMs. Thus, for long examples used in some of our experiments (∼ 4000 words), XL-Net (Yang et al., 2019) runs out of memory even for a batch size of 1 on a 32GB GPU. We observe that CLS-based text classification with pretrained transformers (such as RoBERTa (Liu et al., 2019)) results in near state-of-art performance. Alternate classification techniques using pooled feature representations result in a marginal difference in performance (∼ 0.2% on IMDB sentiment analysis). Pooling does not benefit transformers as they do not suffer from vanishing gradients and positional biases which pooling helps to mitigate in LSTMs ( § 5, § 6). Therefore, we limit the scope of this work to recurrent architectures.

Datasets & Experimental Setup
We experiment with four different text classification tasks: (1) The IMDb dataset (Maas et al., 2011) contains movie reviews and their associated sentiment label; (2) Yahoo! Answers (Zhang et al., 2015) dataset comprises 1.4 million question and answer pairs, spread across 10 topics, where the task is to predict the topic of the answer, using the answer text; (3) Amazon reviews (Ni et al., 2019) contain product reviews from the Amazon website, filtered by their category. We construct a 20-class classification task using these reviews 4 ; (4) Yelp (Zhang et al., 2015) is another sentiment polarity classification task.

Reviews
For these datasets, we only use the text and labels, ignoring any auxiliary information (like title or location). We select subsets of the datasets with sequences having greater than 100 words to better understand the impact of vanishing gradients and positional bias in recurrent architectures. A summary of statistics is presented in Table 1 In all the experiments, we use a single-layered BiLSTM with hidden dimension size of 256 and embedding dimension size of 100 (initialized with GloVe vectors (Pennington et al., 2014) trained on a 6 billion word corpus). The sentence embeddings generated by the BiLSTM are passed to a final classification layer to obtain per-class probability distributions. We train our models using Adam optimizer (Kingma and Ba, 2014), with a learning rate of 2 × 10 −3 . The batch size is set to 32 for all the experiments. We train for 20 epochs and select the model with the best validation accuracy. All experiments are repeated over 5 random seeds using a single GPU (Tesla K40). 5

Gradient Propagation
In this section, we study the flow of gradients in different architectures and training regimes. Pooling techniques used in conjunction with BiLSTMs provide a direct gradient pathway to intermediate hidden states. However for BiLSTMs without pooling, it is crucial that the parameters for the input, output, and forget gates are appropriately learned so that the loss backpropagates across long input sequences, without the gradients vanishing.
Experimental Setup: In order to quantify the extent to which the gradients vanish across different word positions, we compute the gradient of the loss function w.r.t the hidden state at every word position t, and study their 2 norm ( ∂L ∂ht ). To aggregate the gradients across multiple training exam-5 Further details to aid reproducibility are in the Appendix B.2. ples (of different lengths), we linearly interpolate the distribution of gradient values for each example to a fixed length between 1 and 100. The gradient values at each (normalized) position are averaged across all the training examples. We plot these values (on a log scale) after training on the first 500 IMDb reviews to study the effect of gradient vanishing at the beginning of training (Figure 2a).
To understand how the distribution of gradients (across word positions) changes with the number of training batches, we compute the ratio of the gradient norm corresponding to the word at the middle and word at the end: ∂L ∂h mid / ∂L ∂h end . 6 We call this the vanishing ratio and use it as a measure to quantify the extent of vanishing (where lower values indicate severe vanishing). Each training batch on the x-axis in Figures  Results It is evident from Figure 2a that the gradients vanish significantly for BiLSTM, with ∂L ∂ht falling to the order of 10 −6 as we approach the middle positions in the sequence. This effect is even more pronounced for the case of BiLSTM LowF , which uses the Xavier initialization (Glorot and Bengio, 2010) for the bias of the forget-gate. The plot suggests that specific initialization of the gates with best practices (such as setting the bias of forget-gate to a high value) helps to reduce the extent of the issue, but the problem still persists. In contrast, none of the pooling techniques face this issue, resulting in an almost straight line.
Additionally, from Figure 2b we note that the problem of vanishing gradients is more pronounced at the beginning of training, when the gates are still untrained. The problem continues to persist, albeit to a lesser degree, until later in the training process. This specifically limits the performance of BiLSTM in resource-constrained settings, with fewer training examples. For instance, in the 1K training data setting, BiLSTM has an extremely low value of vanishing ratio (∼ 10 −3 ) at the 200 th training batch (denoted by red vertical line in the plot), when it achieves nearly 100% accuracy on the training data. 7 Consequently, the BiLSTM model (prematurely) achieves a high training accuracy, solely based on the starting and ending few words, well before the gates can learn to allow the gradients to pass   through (and mitigate the vanishing gradients problem). Further reduction in vanishing ratio is unable to improve validation accuracy, due to saturation in training. To examine this more closely, we tabulate the vanishing ratios at the point where the model reaches 95% accuracy on the training data in Table 2. A low value at this point indicates that the gradients are still skewed towards the ends, even as the model begins to overfit on the training data. The vanishing ratio is low for BiLSTM, especially in low-data settings. This results in a 13-14% lower test accuracy in the 1K data setting, compared to other pooling techniques. We conclude that the phenomenon of vanishing gradients results in poorer performance of BiLSTMs. Encouragingly, pooling methods do not exhibit low vanishing ratios, right from the beginning of training, leading to performance gains as demonstrated in the next section.

Positional Biases
Analyzing the gradient propagation in BiLSTMs suggests that standard recurrent networks are bi-ased towards the end tokens, as the overall contribution of distant hidden states is extremely low in the gradient of the loss. This implies that the weights of various parameters in an LSTM cell (all cells of an LSTM have tied weights) are hardly influenced by the middle words of the sentence. In this light, we aim to evaluate positional biases of recurrent architectures with different pooling techniques.

Evaluating Natural Positional Biases
Can organically trained recurrent models skip over unimportant words on either ends of the sentence?
Experimental Setup: We append randomly chosen Wikipedia sentences to the input examples of two text classification tasks, based on IMDb and Amazon Reviews, only at test time, keeping the training datasets unchanged. Wikipedia sentences are declarative statements of fact, and should not influence the sentiment of movie reviews, and given the diverse nature of the Wikipedia sentences it is unlikely that they would interfere with the few categories (i.e. the labels) of Amazon product reviews. Therefore, it is not unreasonable to expect the models to be robust to such random noise, even though they were not trained for the same. We perform this experiment in three configurations, such that original input is preserved on the (a) left, (b) middle, and (c) right of the modified input. For these configurations, we vary the length of added Wikipedia text in proportion to the length of the original sentence. Figure 4 illustrates the setup when 66% of the total words come from Wikipedia. Results: The effect of adding random words can be seen in Figure 3. We draw two conclusions: (1) Adding random sentences on both ends is more detrimental to the performance of BiLSTM as compared to the scenario where the input is appended to only one end. 8 This corroborates our previous findings that these models largely account for information at the ends for their predictions.
(2) We speculate that paying equal importance to all hidden states prevents MEANPOOL from distilling out important information effectively, making it more susceptible to random noise addition. On the contrary, both max-pooling and attention based architectures like MAXPOOL, ATT and MAXATT are significantly more robust in all the settings. This indicates that max-pooling and attention can help account for salient words and ignore unrelated ones, regardless of their position. Lastly, we provide concurring results on the Amazon dataset, and examine the robustness of different models given lesser training data in Appendix D.
8 One practical implication of this finding is that adversaries can easily attack middle portions of the input text.

Training to Skip Unimportant Words
How well can different models be trained to skip unrelated words?
Experimental setup: We create new training datasets by appending random Wikipedia sentences to the original input examples of the datasets described in § 4, such that 66% of the text of each new training example comes from Wikipedia sentences (see Figure 4). We experiment with a varying number of training examples, however, the test set remains the same for fair comparisons.

Results
The results are presented in Table 3. First, we note that BiLSTM severely suffers when random sentences are appended at both ends. In fact, the accuracy of BiLSTM in mid settings drops to 50%, 12%, 5%, 50% on IMDb, Yahoo, Amazon, Yelp datasets respectively, which is equal to the majority class baseline. However, the performance drop (while large) is not as drastic when sentences are added to only one end of the text. We speculate that this is because a BiLSTM is composed of a forward and a backward LSTM, and when random sentences are appended to the left, the backward LSTM is able to capture information about the original sentence on the right and vice versa.
Second, while accuracies of all pooling techniques begin to converge given sufficient data, the differences in low training data regime are substantial. Further, the poor performance of BiLSTM re-validates the findings of § 5, where we hypothesize that the model's training saturates before the gradients can learn to reach the middle tokens. 9

Fine-grained Positional Biases
How does the position of a word affect its contribution to a model's prediction? Experimental Setup: We aim to achieve a finegrained understanding of model biases w.r.t. each word position, as opposed to evaluating the same at a coarse level (between left, mid and right) as in the previous experiment ( § 6.2). To this end, we define Normalized Word Importance (NWI), a metric to determine the per-position importance of words as attributed by the model. It measures the importance of a particular word (or a set of words) on a model's prediction by calculating the change in the model's confidence in the prediction after replacing it with UNK. (Figure 5). The evaluation is further extended by removing a sequence of k consecutive words to get a smoother metric. The metric is adapted from past efforts to assign word importance, with some differences (Khandelwal et al., 2018;Verwimp et al., 2018;Jain and Wallace, 2019). 11 We provide a complete description of the algorithm to compute NWI in Appendix F, along with further evaluation on IMDb and Amazon datasets.

Results:
The results from this experiment are presented in Figure 6 (on the Yahoo dataset). The NWI for architectures with pooling indicate no bias w.r.t. word position, however for BiLSTM there exists a clear bias towards the extreme words on either ends (c.f. Figure 6a). The word importance plots in Figure 6b & 6c demonstrate how pooling is able to learn to disambiguate between words that are important for sentence classification significantly better as opposed to BiLSTM. There is a clear peak in the middle in case of 'mid' setting, and on the left in case of 'left' setting for all the pooling architectures. BiLSTM is unable to respond to middle words in Figure 6c. However, they show reasonably higher importance to the left tokens in Figure 6b which is justified by their good performance in the 'left' experimental setting in Table 3. Results for NWI evaluation on all datasets and modified settings (left, mid and right) are available in Appendix F, and are consistent with the representative graphs in Figure 6. We also perform such an analysis on models that are trained on datasets with shorter sentences. Interestingly, the NWI analysis for the Yahoo short dataset in Figure 6d shows that while BiLSTM can better respond to middle words for shorter sentences, it still remains heavily biased towards the ends. We detail these findings in Appendix F.1

Discussion & Conclusion
Through detailed analysis we identify why and when pooling representations are beneficial in RNNs. While some of the results pertaining to gradient propagation in pooling-based RNNs may be obvious in hindsight, we note that this is the first work to systematically and explicitly analyze the phenomenon.
1. We attribute the performance benefits of pooling techniques to their learning ability (pooling mitigates the problem of vanishing gradients), and positional invariance (pooling eliminates positional biases). Our findings suggest that pooling offers large gains when training examples are few and long, or when salient words lie in the middle of the sequence.
2. In § 5, we observe that gradients in BiLSTM vanish only in initial iterations, but recover slowly during further training. We link the observation with training saturation to provide insights as to why BiLSTMs fail in lowresource setups but pooled architectures don't.
3. We show that BiLSTMs suffer from positional biases even when sentence lengths are as short as 30 words (Figure 6d).
4. We note that pooling makes models significantly more robust to insertions of random words on either end of the input regardless of the amount of training data (Figures 3, 8, 9).
5. Lastly, we introduce a novel pooling technique (max-attention) that combines the benefits of max-pooling and attention and achieves superior performance on 80% of our tasks. Most of our insights are derived for sequence classification tasks using RNNs. While our proposed pooling method and analyses are broadly applicable, it remains a part of the future work to evaluate its impact on other tasks and architectures. Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820.

A Equations for Recurrent Networks
In this section, we provide a mathematical formulation of the equations governing LSTMs and basic RNNs.

A.1 Basic RNN
Recurrent Neural Networks use a series of input sequence x t and pass it sequentially over a network of hidden states where each each hidden state leads to the next. Mathematically, this is given by: where x t refers to the input sequence at time step t, and W , U , V are weights for the RNN cell, and σ is a non-linearity of choice.

A.2 LSTM
The forward propagation of information in a basic LSTM are governed by the following equations: where at time t, h t is the hidden state, c t is the cell state, x t is the input, and i t , f t , g t , o t are the input, forget, cell, and output gates, respectively. σ is the sigmoid function, and * is the Hadamard product.

A.3 MEANPOOL
For a mean-pooled LSTM, while the forward propagation remains the same as BiLSTM, the output embedding is given by: t represents the i th dimension of the hidden state at time step = t, and s emb represents the final output embedding returned by the recurrent structure. This implies that during backpropagation we find a direct influence of the t th hidden state as: Amazon Reviews The Amazon Reviews Dataset (Ni et al., 2019) includes reviews (ratings, text, helpfulness votes) and product metadata (descriptions, category etc.) pertaining to products on the Amazon website. We extract the product category and review text corresponding to 2500 reviews from to each of the following 20 classes: In the standard setting, we ensure that all reviews have lengths between 100 and 500 words.
IMDb The IMDb Movie Reviews Dataset (Maas et al., 2011) is a popular binary sentiment classification task. We take a subset of 20000 reviews that have length greater than 100 words for the purposes of experimentation in this paper.
Yahoo Yahoo! Answers (Zhang et al., 2015) has over 1,400,000 question and answer pairs spread across 10 classes. We do not use information such as question, title, date and location for the purpose of classification. As in the case of Amazon reviews, in the standard setting, we ensure that all answers have lengths between 100 and 1000 words, while in the short sentence setting, the maximum answer length in the filtered dataset is 100 words. Yelp Reviews : Yelp Reviews (Zhang et al., 2015) is a sentiment analysis task with 5 matching classes. For the purposes of experimentation, we create a subset which is filtered to contain sentences in the range 100 to 1000 tokens. Further, all reviews with a score of 4 or 5 are maked positive, while those with a score of 1 or 2 are marked negative for the binary classification task.

B.2 Reproducibility
Computing Infrastructure For all the experiments described in the paper, we use a Tesla K40 GPUs supporting a maximum of 10GB of GPU memory. All experiments can be performed on a single GPU. The brief experimentation done on transformer models was done using Tesla V100s that support 32 GB of GPU memory.

Run Time
The average run-time for each epoch varies linearly with the amount of training data and average sentence length. For the mode with 25K training data in standard setting (sentences with greater than 100 words, and no wikipedia words) the average training time for 1 epoch is under 2 minutes. Further, across all pooling techniques, the run time varies only marginally.

Number of Parameters
The number of parameters in the model varies with the vocabulary size. We cap the maximum vocabulary size to 25,000 words. However, in the 1K training data setting, the actual vocabulary size is lesser (depending on the training data). The majority of the parameters of the model are accounted for in the model's embedding matrix = (vocabulary size)×(embedding size). The number of parameters for the main LSTM model are around 70,000, with the ATT model hav-ing a few more parameters than other methods due to a learnable query vector.

Validation Scores
We provide validation results in Table 2 for the standard setting. However, in interest of brevity, we only detail the test scores in all subsequent tables. Note that we always select the model based on the best validation accuracy during the training process (among all the epochs).

Evaluation Metric
The evaluation metric used is the model's accuracy on the test set and is reported as an average over 5 different seeds. All the classes are nearly balanced in the datasets chosen, hence standard accuracy metric serves as an accurate indicator.
Hyperparameters search An explicit hyperparameter search is not performed for each model in each training setting over all seeds, since the purpose of the paper is not to beat the state of art, but rather to analyze the effect of pooling in recurrent architectures. We do note that, in the manual search performed on the learning rates of {1 × 10 −3 , 2 × 10 −3 , 5 × 10 −3 } on the IMDb and Yahoo datasets, we find that for all the pooling and non-pooling methods discussed, models trained on learning rate equal to 2 × 10 −3 showed the best validation accuracy. Thus, we use that for all the following results. However, we do perform a hyperparameter search for the best regularization parameters as described in Appendix E.3. We keep the embedding dimension and hidden dimension fixed for all experiments.

C Gradient Propagation
The plots of the change in vanishing ratios for ATT, MAXPOOL and MEANPOOL are shown in Figure 7. BiLSTM is unrepsonsive to any appended tokens as long as the 'left' text is preserved in the 1K and 5K setting. But this bias dilutes with more training samples. Given sufficient data (more than 10K unique examples) the effect of appending random words on both ends is more detrimental than that on appending at only one end. This completes the representative analysis for BiL-STM and MAXATT shown in Figure 2. It can be seen that for all the different pooling types discussed in this paper, the vanishing ratios are small right from the beginning of training. This motivates future research to further formally analyze and discover other learning advantages (apart from vanishing ratios) that distinguish the performance of one pooling technique from the other.

D Evaluating Natural Positional Biases
In line with our results in § 6.1, we further evaluate models trained on the Amazon dataset in the same settings to re-validate our results. The effect of appending random Wikipedia sentences to input examples on models trained on the Amazon dataset can be found in Figure 8. We use the model trained on 10K examples to perform this experiment. The graphs show similar findings as in Figure 3, and further supports the hypothesis that BiLSTM gives a strong emphasis on extreme words when trained on standard datasets, which is why its performance significantly deteriorates when random Wikipedia sentences are appended on both ends.
This indicates a learning bias, where the BiLSTM pays greater emphasis to outputs of one chain of the bidirectional LSTM. It is interesting to note that on reducing the training data, this bias increases significantly in the case of IMDb dataset as well.
We hypothesize that such a phenomenon may have resulted due to an artifact of the training process itself, that is, the model is able to find 'easily identifiable' important sentiment at the beginning of the reviews during training (speculatively due to the added effects of padding to the right). Therefore, given less training data, BiLSTMs prematurely learn to use features from only one of the two LSTM chains and (in this case) the left → right chain of the dominates the final prediction. We confirm from Figure 9 that with a decrease in training data (such as in the 1K IMDb data setting), the bias towards one end substantially increases, that is, BiLSTM is extremely insensitive to random sentence addition, as long as the left end is preserved.
Practical Implications We observe that MEAN-POOL and BiLSTM can be susceptible to changes in test-time data distribution. This questions the use of such models in real word settings. We speculate that paying equal importance to all hidden states handicaps MEANPOOL from being able to distil out important information effectively, while the preceding discussion on the effect of size of training data highlights the possible cause of this occurrence in BiLSTM. We observe that other pooling methods like MAXATT are able to circumvent this issue as they are only mildly affected by the added Wikipedia sentences.
fication and dataset size settings (including those which were skipped in the main paper for brevity); (b) evaluate the same experiment in a setting where input examples are shorter in length.

E.1 Full Evaluation
For completeness, we perform the evaluation in § 6.2 on each of {1K, 2K, 5K, 10K, 25K} dataset size settings, and also report the results when Wikipedia words are appended on the right, preserving the original input to the left. We report results for the Yahoo and Amazon datasets in Table 5 and the IMDb and Yelp Reviews datasets in Table 6. It can be noted that the advantages of MAXATT over other pooling and non-pooling techniques significantly increase in the three Wikipedia settings in each of the tables. This suggests that MAXATT performs better in more challenging scenarios where the important signals are hidden in the input data. Further, the performance advantages of MAXATT are more when amount of training data is less.  For shorter sequences, we reuse two of our text classification tasks: (1) Yahoo! Answers; and (2) Amazon Reviews. Similar to the setting with long sentences in the main paper, we use only the text and labels, ignoring any auxiliary information (like title or location). We select subsets of the datasets with sequences having a length (number of space separated words) less than 100. A summary of statistics with respect to sentence length and corpus size is given in Table 7.
ting, we observe that BiLSTM performs significantly better on shorter sequences as opposed to the long sequences. For instance, in case of Amazon Dataset (Mid), under the 25K data setting, the classification accuracy increases from 7.8% in Table 5 to 51.5% in Table 8, which is a significant improvement from only doing as well as majority guessing in the former. We note that most of the learning issues of BiLSTM in long sentence setting are largely absent when sentence lengths are short, with BiLSTM also emerging as the best-performing model in a few cases. This corroborates the effect of gradients vanishing with longer time steps.

E.3 On using regularization
For the experiments in the work, we do not regularize trained LSTMs. This has two analytical advantages (1) we can examine the benefits of pooling without having to account for the the effect of regularization; and (2) training to 100% accuracy acts as an indicator of training the models adequately. However, for validation, we also performed our experiments on the IMDb dataset with 2 different types of regularization schemes, following best practices used in previous works (Merity et al., 2017). We use DropConnect (Wan et al., 2013) 12 and Weight Decay 13 for regularization of all the models. We observe that the effect of regularization consistently improves the final accuracies by 1-2% across the board. However, even after sustained training (up to 50 epochs), BiLSTM still suffers from the learning issues outlined in the paper. The goal of this paper is not to study the effect of various regularization schemes, but to merely understand the effect pooling in improving the performance of BiLSTM.

F Fine-grained Positional Biases
We detail the method for calculating the Normalized Word Importance (NWI) score in Algorithm 1.