MODE-LSTM: A Parameter-efficient Recurrent Network with Multi-Scale for Sentence Classification

The central problem of sentence classiﬁcation is to extract multi-scale n-gram features for understanding the semantic meaning of sentences. Most existing models tackle this problem by stacking CNN and RNN models, which easily leads to feature redundancy and over-ﬁtting because of relatively limited datasets. In this paper, we propose a simple yet effective model called M ulti-scale O rthogonal in D epend E nt LSTM (MODE-LSTM), which not only has effective parameters and good generalization ability, but also considers multi-scale n-gram features. We disentangle the hidden state of the LSTM into several independently updated small hidden states and apply an orthogonal constraint on their recurrent matrices. We then equip this structure with sliding windows of different sizes for extracting multi-scale n-gram features. Extensive experiments demonstrate that our model achieves better or competitive performance against state-of-the-art baselines on eight benchmark datasets. We also combine our model with BERT to further boost the generalization performance.


Introduction
Sentence classification (SC) is a fundamental and traditional task in natural language processing (NLP), which is widely used in many subareas, such as sentiment analysis (Wang et al., 2016a and question classification (Shi et al., 2016). The central problem of SC is to understand the semantic meaning of a sentence by some key-phrases located at different positions (Wang et al., 2015).
CNNs excel at extracting n-gram features of sentences through a convolution operation followed by non-linear and pooling layers and have achieved impressive results in sentence classification (Kalchbrenner et al., 2014;Kim, 2014). However, the convolution operation itself is linear, which may not be sufficient to model the non-consecutive dependency of the phrase (Lei et al., 2015) and may lose the sequential information (Madasu and Anvesh Rao, 2019). As shown in Figure 1, the weighted sum of the phrase "not almost as bad" does not capture the non-consecutive dependency of "not bad" very well and ignores the sequential information. On the other hand, LSTMs (Hochreiter and Schmidhuber, 1997) are suitable for encoding structure-dependent semantics by storing previous word representations and preserving sequential information. However, LSTMs are still biased toward later words and ignoring the earlier words (Yin et al., 2017), so some current methods (Lai et al., 2015;Wang et al., 2016b;Song et al., 2018) combine the CNN and LSTM by stacking. However, merely stacking multiple layers can easily lead to feature redundancy and overfitting, because only relatively small training sets are available for SC tasks (Yin and Schütze, 2015;Guo et al., 2019). Hence, some researchers (Zhao et al., 2018a;Madasu and Anvesh Rao, 2019) additionally attach an over-parameterized attention mechanism to enhance salient features and remove redundancy, but overfitting still occurs due to the increase in parameters for limited datasets.
A flexible combination method is to model nonlinear mapping and non-consecutive dependency by replacing the convolution operation with a tensor product (Lei et al., 2015) or RNN unit (Shi et al., 2016;Wang, 2018). However, these methods only consider fixed-size n-gram features. This has apparent drawbacks in that there may be variablesize phrases (n-grams) in a sentence, as shown in Figure 1, we need to extract variable-size n-gram features to form a better sentence representation.
The above observation motivates us to explore a better structure for sentence classification, balancing the capability and complexity. In this paper, we propose a lightweight model called Multiscale Orthogonal inDependEnt LSTM (MODE-LSTM), which has minimal effective parameters, good generalization performance, and considers n-gram features of different scales. First, inspired by (Kuchaiev and Ginsburg, 2017), we disentangle the hidden state of LSTM into several independently updated small hidden states, which reduces the number of parameters. Furthermore, an orthogonal constraint is applied to the recurrent transition matrices of the small hidden states to improve the diversity of features. We call this structure Orthogonal InDependEnt LSTM (ODE-LSTM). Then we use ODE-LSTM within a local window for extracting n-gram features instead of simply using a weighted sum as in convolution. Specifically, we introduce a Triple-S (Slide-Split-Stack) operation that splits a sentence into multiple sub-sentences by a sliding window and stacks them together. These sub-sentences are regarded as a mini-batch, which can be processed in parallel by a shared ODE-LSTM. We take the last hidden state of ODE-LSTM as the n-gram features for each sub-sentence. Furthermore, in order to capture the variable-size phrases in sentences, we use different scale windows with different initialized ODE-LSTMs to extract features of multiple scale phrases. We refer to this structure as a multi-scale ODE-LSTM (MODE-LSTM).
MODE-LSTM can extract multi-scale n-gram features like a CNN, while retaining the non-linear ability and long-term dependency of LSTMs, so it has stronger modeling ability but with fewer parameters than other methods. MODE-LSTM is analogous to a 1D CNN using multiple filters with different window sizes, but it uses recurrent transitions instead of the convolution operation. We conduct experiments on eight sentence classification datasets. The experimental results show that our proposed model achieves comparable or better results on these datasets with fewer parameters than other models. In addition, we further improve our model's generalization performance by integrating the BERT representation of the sentence.
2 Related Work CNN-based models Kalchbrenner et al. (2014) propose a deep CNN model with a dynamic k-max pooling operation for the semantic modeling of sentences. However, a simple one-layer CNN with fine-tuned word embeddings also achieves remarkable results (Kim, 2014). Some researchers also use multiple word embeddings as inputs to further improve performance (Yin and Schütze, 2015;Zhang et al., 2016b). Xiao et al. (2018) propose a transformable CNN that can adaptively adjust the scope of the convolution filters. Although the above CNNbased methods perform excellently in extracting local semantic features, linear convolution operation limits the ability of modeling non-consecutive dependency and sequential information.
RNN-based models RNNs are suitable for processing text sequences and modeling long-term dependencies, so it is also used for sentence modeling. Recently, some work incorporate residual connections (Wang and Tian, 2016) or dense connections (Ding et al., 2018) into recurrent structures to avoid vanishing gradients. Dangovski et al. (2019) introduce a rotational unit of memory into RNNs for recalling long-distance information.  propose an HS-LSTM that can automatically discover structured representation in a sentence via reinforcement learning. However, these RNN-based models still display the bias problem where later words are more dominant than earlier words (Yin et al., 2017).
Hybrid models A natural strategy is to combine the advantages of CNNs and RNNs by stacking. Lai et al. (2015) equip an RNN with maxpooling to tackle the bias problem of RNNs. Zhou et al. (2015) use 1D convolutions to extract phrase features followed by an LSTM to obtain the sentence representation, and some subsequent work (Wang et al., 2016a,b;Lee and Dernoncourt, 2016) are similar. Alternatively,  first model long-term dependencies using an LSTM and then apply a CNN to extract task-specific features. However, these methods simply stack multiple layers, resulting in feature redundancy and overfitting because of limited datasets (Yin and Schütze, 2015;Guo et al., 2019). Some researchers have introduced attention mechanisms (Er et al., 2016;Lin et al., 2017;Zhao et al., 2018a; to enhance salient features, but this leads to a large number of parameters that overfit for small-scale datasets. A more flexible way is to combine them by replacing the convolution operation with a tensor product (Lei et al., 2015) or RNN unit (Shi et al., 2016;, which can capture the non-linear ngram features directly. Nevertheless, these methods currently only consider fixed-scale n-gram features. Other models Some work (Tai et al., 2015;Liu et al., 2017; has used tree-LSTMs based on parse trees for sentiment analysis, but the performance depends heavily on the quality of the parser, and the parsing process itself is timeconsuming. Others (Gong et al., 2018;Zhao et al., 2018b;Zheng et al., 2019) have tried using capsule networks with dynamic routing for encoding text representations.
The most relevant work to our approach is the DRNN (Wang, 2018), which also uses RNNs locally to learn semantic features. The differences between their approach and ours are: (1) The DRNN uses GRUs as the recurrent unit while we use the ODE-LSTM, which has better generalization performance.
(2) We introduce the Triple-S operation to execute all sub-sentences in parallel instead of in sequence, which is faster than the DRNN. (3) We consider multi-scale n-gram features in sentences, while DRNN only considers a fixed scale.

Proposed Method
In the following, we start with our most straightforward model, which is a parameter-efficient structure of LSTMs to avoid over-fitting and achieve better generalization performance. This structure is then equipped with local sliding windows to learn key phrase features of the sentence, which is the central problem for understanding sentence semantics (Wang et al., 2015). Finally, we further easily extended our method to capture multi-scale features of the sentence via using different sized windows in parallel.

Orthogonal InDependEnt LSTM (ODE-LSTM)
Given a sentence of T input vectors {x 1 , · · · , x T }, where x t ∈ R d 0 , and d 0 is the dimension of input embeddings. The hidden state h t ∈ R d of LSTM cell can be expressed as follows: where f t , i t , o t are the forget, input and output gates respectively, and g t is the candidate cell state. W ∈ R 4d×d , U ∈ R 4d×d 0 , and b ∈ R 4d are the learnable parameters. σ denotes the sigmoid function, and denotes element-wise multiplication. The number of distinct parameters in the LSTM are 4d . This easily leads to over-fitting for the sentence classification tasks where there are relatively limited data.
To reduce the number of parameters, inspired by (Kuchaiev and Ginsburg, 2017), we disentangle the hidden state h t of the LSTM into K independently updated small hidden states. Specifically, the hidden state at time step t is composed by K small hidden states as where W ∈ R K×4p×p and W k ∈ R 4p×p . Each small hidden state h k t is independently updated by an individual recurrent matrix W k and then merged via concatenation to constitute the hidden state h t at time step t. The update equation of hidden state h t is defined as : where is the tensor-dot operation which denotes the product of two tensors along the K-axis, e.g., Note that standard LSTM is a special case of ODE-LSTM when K = 1.
The updated hidden state h t may be redundant if all hidden states provide similar features. To avoid this, we introduce a penalization loss that orthogonally constrains W to explicitly encourage diversity among hidden states, inspired by (Lin et al., 2017).
With the same size d of hidden states as the LSTM, ODE-LSTM reduces the number of parameters by 4d(d − p). The smaller p is, the more parameter reduction. Because of the disentanglement of hidden states, each small hidden state can focus on a different aspect of semantics, with better generalization performance. Figure 2(d) shows the comparison between ODE-LSTM and LSTM.

Equipping ODE-LSTM with Sliding Window
The core of SC task is to understand the semantics of the sentence, which are determined by key words and variable-size phrases. Although a CNN can capture n-grams, the linear convolution operation is insufficient to model sequential information and non-consecutive dependency of sentences. Our ODE-LSTM can maintain word order, and control information preserving or forgetting through gates for modeling non-consecutive dependency. Taking the phrase "not almost as bad" as an example, the gates can selectively retain the representation of "not" and "bad" while decaying the representation of "almost" and "as", allowing it to perceive the relation "not bad".
Hence, we equip ODE-LSTM with a sliding window for extracting n-gram features, which means that the recurrent transition of ODE-LSTM is only performed in a local window with size S sliding along the sentence, as illustrated in the left of Figure 2(b). S is a hyperparameter. For each target position t, ODE-LSTM will sequentially process S consecutive words in the range (t − S + 1, t) of the sentence and generate relevant hidden states. The last hidden state h t output by ODE-LSTM is used as the n-gram feature of the target position: For convenience, we reshape h t ∈ R K×p to a vector of d dimension. Meanwhile, we pad (S − 1) zeros before the start position of the sentence to maintain consistent window size at all positions. This kind of local way is analogous to DRNN (Wang, 2018), but they process all windows sequentially, equivalent to processing a sentence of length S × T in order, which is highly time-consuming. However, we observe that all windows are independent of each other, so they can be processed in parallel by a GPU, which greatly improves the computational efficiency.
Correspondingly, we introduce a Triple-S (Slide-Split-Stack) operation to compose all the windows, as shown in Figure 2(b). First we split a sentence into multiple sub-sentences by a sliding window with size S, and then stack them together to form a mini-batch B ∈ R T ×S×d 0 . The mini-batch B is fed into an ODE-LSTM, obtaining the n-gram feature matrix H ∈ R T ×d , as shown in Figure 2(c): where h t is calculated by equation (8), corresponding to the n-gram feature at t-th position. In this way, the recurrent steps of ODE-LSTM are determined by S rather than the sentence length T , so the time complexity is much lower than DRNN.

Multi-Scale ODE-LSTM (MODE-LSTM)
Sentence phrases have multiple granularities, i.e., n-gram features at different scales. Nevertheless, what we consider above uses a fixed window size S. A natural idea is to use multiple scale windows in parallel with ODE-LSTM to extract n-gram features of different scales. The Multi-scale ODE-LSTM (MODE-LSTM) model is illustrated in Figure 2(a). According to the Triple-S operation described in Section 3.2, the sentence is converted into multiple mini-batches [B 1 , · · · , B M ] based on different scale window sizes [S 1 , · · · , S M ], where M is the number of scales. Then, the mini-batches are fed into different ODE-LSTMs to obtain the n-gram feature matrix : where H m ∈ R T ×d denotes the n-gram feature matrix of scale S m , m = 1, · · · , M . h m,t ∈ R d denotes the t-th n-gram feature of scale S m . Subsequently, we apply max pooling (MP) along the T -axis over each n-gram feature matrix to extract salient features for each scale, and then concatenate them to constitute the multi-scale feature representation F ∈ R M ×d : Afterward, the feature representation F is reshaped to a vector and fed into an MLP layer with rectified linear unit (ReLU) activation function and a softmax layer for the final classification.

Objective Function
The overall objective function includes a crossentropy category loss and the penalization loss for all ODE-LSTMs. So it's defined as: where N is the number of samples, y n andŷ n are the ground-truth label and softmax output respectively, L Pm is penalization term for the m-th ODE-LSTM, and λ is a hyperparameter for balancing the strength of the orthogonality constraint. We minimize the above function by BPTT.

Experimental Setup
Datasets To evaluate the effectiveness of our model, we conduct experiments on eight widelystudied datasets (Kim, 2014;Liu et al., 2017) for sentence classification. Statistics of these datasets are listed in Table 1. These datasets come from different topics, such as sentiment analysis, movie reviews (MR, SST2, SST5), customer reviews (CR), and idioms (IE); question type (TREC) classification; opinion (MPQA) or subjectivity (SUBJ) classification.
Dataset c l ml T rain Dev T est

Implementation Details
We initialize the word embeddings with 300D pre-trained GloVe vectors (Pennington et al., 2014) and incorporate 50D character embeddings constructed by a convolution layer with a max pooling layer to avoid the Out-Of-Vocabulary (OOV) problem . These two embeddings are then concatenated as the input embeddings and fine-tuned along with model parameters during training. We use three scale windows, [5,10,15], to initialize various ODE-LSTMs. K is set to 2 and the size p of each small hidden state is set to 50 for each scale. This configuration results in a 300D multiscale feature representation for classification. For regularization, we employ dropout with a rate of 0.2 and 0.5 for input embeddings and the single MLP hidden layer, respectively. L2 regularization, with a factor of 0.001, is applied to the weights of the softmax layer. The hyperparameter λ is set to 0.01, and the batch size is set to 50. Our model is optimized by Adam with a learning rate of 1e-3. Similar to (Kim, 2014), these hyperparameters are  Table 2: Experimental accuracy comparison of our model and baselines on eight sentence classification benchmarks. "#Params" represents the approximate number of parameters except input embedddings for models. The results of models marked with * are obtained by our implementation. The input embeddings used in these baselines are the same as our models. Other parameter settings of models are consistent with their references. The remaining results are collected from the corresponding papers. The model marked with † ( ‡) means MODE-LSTM (with BERT base ) is significantly superior to compared model by paired t-test (Wilcoxon, 1945) at p < 0.05 level.
determined by a grid search on the MR dataset and are applied to the other datasets § . .
to 6 and the size p of small hidden states to 50 to make the number of parameters consistent with MODE-LSTM. Table 2 reports the performance of our approaches against other methods. With fewer parameters, MODE-LSTM significantly outperforms the compared models and is superior to DLSTM with an average accuracy gain over 1.0% because ours disentangles the RNN hidden states and considers multi-scale features in sentences. Meanwhile, our model achieves better or similar performance with recent state-of-the-art model HAC. HAC is a complex model that uses deep dilated convolutional layers and a capsule module at each layer. However, our model is simple yet effective, like the one-layer TextCNN. Specifically, although the parameters of TextCNN are less than ours, its parameters increase with the size of the filter window, whereas the parameters of our model are independent of the window size. ODE-LSTM also outperforms LSTM with an average accuracy gain 0.7%, which verifies To investigate how our model makes a difference with others, we visualize the convergence trends in Figure 3. We observe that the direct-stacked C-LSTM (dark blue line) converges quickly on the training set but has poor performance on development or testing sets. Although the Self-Attentive (dark green line) can alleviate feature redundancy by employing the attention mechanism, overfitting still occurs due to a large number of parameters. MODE-LSTM (red line) achieves better generalization performance on development or testing sets than other models.

Combining MODE-LSTM with BERT
Recently, the pre-trained language model BERT (Devlin et al., 2018) is more effective than conventional word embeddings when fine-tuned on downstream tasks. Compared with word embeddings, BERT can learn context-dependent sentence representations. Nevertheless, recent work Xu et al., 2019) has indicated that the self-attention used in BERT disperses the attention distribution and thus overlooks the essential neighboring elements and phrasal patterns. MODE-LSTM can explicitly extract multi-scale local features, which is complementary to BERT representation. Hence, we try to combine MODE-LSTM with BERT to improve the generalization performance of our model further. Concretely, the sentence is fed into BERT base model, and the hidden representation of the last layer of BERT base is used as the input embeddings of MODE-LSTM rather than GloVe and character embeddings. BERT provides contextualized sentence-level representations, which help MODE-LSTM understand sentence semantics more accurately. The detailed diagram and the hyper-parameter settings of this configuration can be found in the appendix.
We compare MODE-LSTM equipped with BERT (MODE-LSTM + BERT) with some recent strong baselines that also combine with pre-trained sentence representations, including InferSent (Conneau et al., 2017), combining ELMo with bag-ofwords (BOW + ELMo) (Perone et al., 2018) or HAC (HAC + ELMo) (Zheng et al., 2019), universal sentence encoder (USE) (Cer et al., 2018), and BERT. The results are shown in the bottom row of Table 2. Using the BERT representation, MODE-LSTM can further boost the generalization performance. Although BERT already provides strong performance on almost all datasets, it may tend to ignore the local phrasal information due to the self-attention mechanism. Therefore, the combination of MODE-LSTM and BERT can further improve the prediction power, which indicates that our model can better understand the semantic meaning. Notably, our model without BERT has surpassed some pre-trained models, such as InferSent and BOW + ELMo, and is comparable to USE, verifying its effectiveness and generalization.

Ablation Study
In this section, we investigate to study the independent effect of each component in our proposed model. We explore the influence of the window scales, the penalization loss, and the character embeddings. The results are reported in Table 3. Com-Examples G.T. TextCNN DLSTM Ours 1. While it 'sgenuinely cool to hear characters talk about early :::::::::::::::: rap records sugar hill gang etc :::::::::::::::::::::: the constant referencing of hip-hop arcana can ::::::: alienate even the savviest audiences. N P N N 2. :::::::::: I admire it and yet ::::::: cannot recommend it because it :::::::::::::::::: overstays its natural running time. N P P N Table 4: Case study of our model compared to TextCNN and DLSTM. "G.T." is ground-truth. "N" and "P" represent Negative and Positive. Words with dotted lines, underlines, and wavy lines correspond to the important positions extracted by TextCNN, DLSTM, and MODE-LSTM respectively. pared to using multiple windows with different scales (Row 1), using a single scale (Row 2-4) significantly reduces the accuracy. This demonstrates the necessity of integrating multi-scale windows to learn variable-size phrases in sentences. We can see that eliminating penalization loss (Row 5) or character embeddings (Row 6) also hurts the performance, which verifies that these components are beneficial to our model.

Case study
To explore why our model outperforms TextCNN and DLSTM, we display several most contributing positions in max-pooling by visualization techniques introduced in (Li et al., 2015). Table 4 shows two examples on the MR dataset. In the first example, CNN wrongly captures the key phrase genuinely cool. Thus the sentence is misclassified as Positive, while DLSTM and our model capture the non-consecutive dependency according to the key word while. Hence they attend to the second half of the sentence for correct classification. In the second sample, all the three models extract the key phrase I admire it, which suggests classifying the sentence as positive. Therefore, both TextCNN and DLSTM fail in this case. However, our model also extracts key phrases cannot and overstays its natural by learning multi-scale features so that it can obtain the correct answer.

Model Analysis
Impact of the value K To study the influence of the value K (the number of small hidden states), we conduct experiments on MR and SUBJ datasets. We fix the multi-scale feature representation output by MODE-LSTM to 300D and tune the value K.
The larger K is, the smaller the size of the small hidden states. The results are reported in Figure  4(a). We found that K = 2 is a good trade-off between model accuracy and parameters. When K is too large, the hidden size is too small to provide enough features, which causes the overall performance to decrease.
Impact of the window size We then explore the effect of window size when using only one scale window. We found that the optimal window size may be different for different datasets, as shown in Figure 4(b). The optimal window size for MR is 5, while for SUBJ, it is 20. We speculate that the reason is that the length of SUBJ sentences are longer than MR, and so long-term dependencies may be more prominent. Impact of training set size To further verify our model's generalization, we investigate the influence of different training set sizes. The results on MR are shown in Figure 5(a). MODE-LSTM outperforms others with an accuracy gain over 8% when only having 100 training samples. As the size continues to increase, the gain gradually decreases but our model is still superior to the others. Training time comparison We assess the training time of our model and DLSTM on an NVIDIA GTX 1080ti GPU in Figure 5(b), testing on MR. In the case of using a single scale window, the training time for each model's epoch increases with the window size due to the recurrent structure. However, our model's training time marginally increases thanks to the ability to run in parallel by the Triple-S operation, which is 5 ∼ 10 × faster than DLSTM performs in sequence. Since multiple window scales are independent and parallel, the training time for our multi-scale version mainly depends on the maximum window size. For example, with the same number of parameters and a maximum window size of 15, the training time for a multi-scale version is similar to that of the single-scale version on MR (15 vs. 13, T(s)/epoch).

Conclusion
This study presents a novel parameter-efficient model called MODE-LSTM that can capture multiscale n-gram features in sentences. Instead of the tradition of exploiting complicated operations by stacking CNNs and RNNs, or attaching overparameterized attention mechanisms, our work provides a lightweight method for improving the ability of neural models for sentence classification. Through disentangling the hidden states of the LSTM and equipping the structure with multiple sliding windows of different scales, MODE-LSTM outperforms popular CNN/RNN-based methods and hybrid methods on various benchmark datasets. In future work, we plan to validate its effectiveness for aspect-level sentiment classification.