Sentence-State LSTM for Text Representation

Bi-directional LSTMs are a powerful tool for text representation. On the other hand, they have been shown to suffer various limitations due to their sequential nature. We investigate an alternative LSTM structure for encoding text, which consists of a parallel state for each word. Recurrent steps are used to perform local and global information exchange between words simultaneously, rather than incremental reading of a sequence of words. Results on various classification and sequence labelling benchmarks show that the proposed model has strong representation power, giving highly competitive performances compared to stacked BiLSTM models with similar parameter numbers.


Introduction
Neural models have become the dominant approach in the NLP literature. Compared to handcrafted indicator features, neural sentence representations are less sparse, and more flexible in encoding intricate syntactic and semantic information. Among various neural networks for encoding sentences, bi-directional LSTMs (BiLSTM) (Hochreiter and Schmidhuber, 1997) have been a dominant method, giving state-of-the-art results in language modelling (Sundermeyer et al., 2012), machine translation (Bahdanau et al., 2015), syntactic parsing (Dozat and Manning, 2017) and question answering (Tan et al., 2015).
Despite their success, BiLSTMs have been shown to suffer several limitations. For example, their inherently sequential nature endows computation non-parallel within the same sentence (Vaswani et al., 2017), which can lead to a computational bottleneck, hindering their use in the in- dustry. In addition, local ngrams, which have been shown a highly useful source of contextual information for NLP, are not explicitly modelled (Wang et al., 2016). Finally, sequential information flow leads to relatively weaker power in capturing longrange dependencies, which results in lower performance in encoding longer sentences (Koehn and Knowles, 2017).
We investigate an alternative recurrent neural network structure for addressing these issues. As shown in Figure 1, the main idea is to model the hidden states of all words simultaneously at each recurrent step, rather than one word at a time. In particular, we view the whole sentence as a single state, which consists of sub-states for individual words and an overall sentence-level state. To capture local and non-local contexts, states are updated recurrently by exchanging information between each other. Consequently, we refer to our model as sentence-state LSTM, or S-LSTM in short. Empirically, S-LSTM can give effective sentence encoding after 3 -6 recurrent steps. In contrast, the number of recurrent steps necessary for BiLSTM scales with the size of the sentence.
At each recurrent step, information exchange is conducted between consecutive words in the sentence, and between the sentence-level state and each word. In particular, each word receives information from its predecessor and successor simultaneously. From an initial state without information exchange, each word-level state can obtain 3-gram, 5-gram and 7-gram information after 1, 2 and 3 recurrent steps, respectively. Being connected with every word, the sentence-level state vector serves to exchange non-local information with each word. In addition, it can also be used as a global sentence-level representation for classification tasks.
Results on both classification and sequence labelling show that S-LSTM gives better accuracies compared to BiLSTM using the same number of parameters, while being faster. We release our code and models at https://github.com/ leuchine/S-LSTM, which include all baselines and the final model.

Related Work
LSTM (Graves and Schmidhuber, 2005) showed its early potentials in NLP when a neural machine translation system that leverages LSTM source encoding gave highly competitive results compared to the best SMT models (Bahdanau et al., 2015). LSTM encoders have since been explored for other tasks, including syntactic parsing (Dyer et al., 2015), text classification (Yang et al., 2016) and machine reading (Hermann et al., 2015). Bidirectional extensions have become a standard configuration for achieving state-of-the-art accuracies among various tasks (Wen et al., 2015;Ma and Hovy, 2016;Dozat and Manning, 2017). S-LSTMs are similar to BiLSTMs in their recurrent bi-directional message flow between words, but different in the design of state transition.
CNNs (Krizhevsky et al., 2012) also allow better parallelisation compared to LSTMs for sentence encoding (Kim, 2014), thanks to parallelism among convolution filters. On the other hand, convolution features embody only fix-sized local ngram information, whereas sentence-level feature aggregation via pooling can lead to loss of information (Sabour et al., 2017). In contrast, S-LSTM uses a global sentence-level node to assemble and back-distribute local information in the recurrent state transition process, suffering less information loss compared to pooling.
Attention (Bahdanau et al., 2015) has recently been explored as a standalone method for sentence encoding, giving competitive results compared to Bi-LSTM encoders for neural machine translation (Vaswani et al., 2017). The attention mechanism allows parallelisation, and can play a similar role to the sentence-level state in S-LSTMs, which uses neural gates to integrate word-level information compared to hierarchical attention. S-LSTM further allows local communication between neighbouring words.
Hierarchical stacking of CNN layers (LeCun et al., 1995;Kalchbrenner et al., 2014;Papandreou et al., 2015;Dauphin et al., 2017) allows better interaction between non-local components in a sentence via incremental levels of abstraction. S-LSTM is similar to hierarchical attention and stacked CNN in this respect, incrementally refining sentence representations. However, S-LSTM models hierarchical encoding of sentence structure as a recurrent state transition process. In nature, our work belongs to the family of LSTM sentence representations.
S-LSTM is inspired by message passing over graphs (Murphy et al., 1999;Scarselli et al., 2009). Graph-structure neural models have been used for computer program verification (Li et al., 2016) and image object detection (Liang et al., 2016). The closest previous work in NLP includes the use of convolutional neural networks (Bastings et al., 2017; and DAG LSTMs (Peng et al., 2017) for modelling syntactic structures. Compared to our work, their motivations and network structures are highly different. In particular, the DAG LSTM of Peng et al. (2017) is a natural extension of tree LSTM (Tai et al., 2015), and is sequential rather than parallel in nature. To our knowledge, we are the first to investigate a graph RNN for encoding sentences, proposing parallel graph states for integrating word-level and sentence-level information. In this perspective, our contribution is similar to that of Kim (2014) and Bahdanau et al. (2015) in introducing a neural representation to the NLP literature.

Model
Given a sentence s = w 1 , w 2 , . . . , w n , where w i represents the ith word and n is the sentence length, our goal is to find a neural representation of s, which consists of a hidden vector h i for each input word w i , and a global sentence-level hid-den vector g. Here h i represents syntactic and semantic features for w i under the sentential context, while g represents features for the whole sentence. Following previous work, we additionally add s and /s to the two ends of the sentence as w 0 and w n+1 , respectively.

Baseline BiLSTM
The baseline BiLSTM model consists of two LSTM components, which process the input in the forward left-to-right and the backward rightto-left directions, respectively. In each direction, the reading of input words is modelled as a recurrent process with a single hidden state. Given an initial value, the state changes its value recurrently, each time consuming an incoming word.
Take the forward LSTM component for example. Denoting the initial state as − → h 0 , which is a model parameter, the recurrent state transition step for calculating − → h 1 , . . . , − → h n+1 is defined as follows (Graves and Schmidhuber, 2005): where x t denotes the word representation of w t ; i t , o t , f t and u t represent the values of an input gate, an output gate, a forget gate and an actual input at time step t, respectively, which controls the information flow for a recurrent cell − → c t and the state vector The backward LSTM component follows the same recurrent state transition process as described in Eq 1. Starting from an initial state h n+1 , which is a model parameter, it reads the input x n , The BiLSTM model uses the concatenated value of − → h t and ← − h t as the hidden vector for w t : A single hidden vector representation g of the whole input sentence can be obtained using the final state values of the two LSTM components: Stacked BiLSTM Multiple layers of BiLTMs can be stacked for increased representation power, where the hidden vectors of a lower layer are used as inputs for an upper layer. Different model parameters are used in each stacked BiLSTM layer.

Sentence-State LSTM
Formally, an S-LSTM state at time step t can be denoted by: which consists of a sub state h t i for each word w i and a sentence-level sub state g t .
S-LSTM uses a recurrent state transition process to model information exchange between sub states, which enriches state representations incrementally. For the initial state H 0 , we set h 0 to h t i and from g t−1 to g t . We take an LSTM structure similar to the baseline BiLSTM for modelling state transition, using a recurrent cell c t i for each w i and a cell c t g for g. As shown in Figure 1, the value of each h t i is computed based on the values of i+1 and g t−1 , together with their corresponding cell values: where ξ t i is the concatenation of hidden vectors of a context window, and l t i , r t i , f t i , s t i and i t i are gates that control information flow from ξ t i and x i to c t i . In particular, i t i controls information from the input x i ; l t i , r t i , f t i and s t i control information from the left context cell c t−1 i−1 , the right context cell c t−1 i+1 , c t−1 i and the sentence context cell c t−1 g , respectively. The values of i t i , l t i , r t i , f t i and s t i are normalised such that they sum to 1. o t i is an output gate from the cell state c t i to the hidden state The value of g t is computed based on the values . . , f t n+1 and f t g are gates controlling information from c t−1 0 , . . . , c t−1 n+1 and c t−1 g , respectively, which are normalised. o t is an output gate from the recurrent cell c t g to g t . W x , U x and b x (x ∈ {g, f, o}) are model parameters.

Contrast with BiLSTM
The difference between S-LSTM and BiLSTM can be understood with respect to their recurrent states. While BiL-STM uses only one state in each direction to represent the subsequence from the beginning to a certain word, S-LSTM uses a structural state to represent the full sentence, which consists of a sentence-level sub state and n + 2 word-level sub states, simultaneously. Different from BiLSTMs, for which h t at different time steps are used to represent w 0 , . . . , w n+1 , respectively, the word-level states h t i and sentence-level state g t of S-LSTMs directly correspond to the goal outputs h i and g, as introduced in the beginning of this section. As t increases from 0, h t i and g t are enriched with increasingly deeper context information.
From the perspective of information flow, BiL-STM passes information from one end of the sentence to the other. As a result, the number of time steps scales with the size of the input. In contrast, S-LSTM allows bi-directional information flow at each word simultaneously, and additionally between the sentence-level state and every wordlevel state. At each step, each h i captures an increasing larger ngram context, while additionally communicating globally to all other h j via g. The optimal number of recurrent steps is decided by the end-task performance, and does not necessarily scale with the sentence size. As a result, S-LSTM can potentially be both more efficient and more accurate compared with BiLSTMs.
Increasing window size. By default S-LSTM exchanges information only between neighbouring words, which can be seen as adopting a 1word window on each side. The window size can be extended to 2, 3 or more words in order to allow more communication in a state transition, expediting information exchange. To this end, we modify Eq 2, integrating additional context words to ξ t i , with extended gates and cells. For example, with a window size of 2, We study the effectiveness of window size in our experiments.
Additional sentence-level nodes. By default S-LSTM uses one sentence-level node. One way of enriching the parameter space is to add more sentence-level nodes, each communicating with word-level nodes in the same way as described by Eq 3. In addition, different sentence-level nodes can communicate with each other during state transition. When one sentence-level node is used for classification outputs, the other sentencelevel node can serve as hidden memory units, or latent features. We study the effectiveness of multiple sentence-level nodes empirically.

Task settings
We consider two task settings, namely classification and sequence labelling. For classification, g is fed to a softmax classification layer: where y is the probability distribution of output class labels and W c and b c are model parameters. For sequence labelling, each h i can be used as feature representation for a corresponding word w i .
External attention It has been shown that summation of hidden states using attention (Bahdanau et al., 2015;Yang et al., 2016) give better accuracies compared to using the end states of BiLSTMs. We study the influence of attention on both S-LSTM and BiLSTM for classification. In particular, additive attention (Bahdanau Here W α , u and b α are model parameters. External CRF For sequential labelling, we use a CRF layer on top of the hidden vectors h 1 , h 2 , . . . , h n for calculating the conditional probabilities of label sequences Ma and Hovy, 2016): where W y i−1 ,y i s and b y i−1 ,y i s are parameters specific to two consecutive labels y i−1 and y i .
For training, standard log-likelihood loss is used with L 2 regularization given a set of gold-standard instances.

Experiments
We empirically compare S-LSTMs and BiLSTMs on different classification and sequence labelling tasks. All experiments are conducted using a GeForce GTX 1080 GPU with 8GB memory.

Model
Time ( (Ratinov and Roth, 2009). Statistics of the four datasets are shown in Table 1.
Hyperparameters. We initialise word embeddings using GloVe (Pennington et al., 2014) 300 dimensional embeddings. 1 Embeddings are finetuned during model training for all tasks. Dropout (Srivastava et al., 2014) is applied to embedding hidden states, with a rate of 0.5. All models are optimised using the Adam optimizer (Kingma and Ba, 2014), with an initial learning rate of 0.001 and a decay rate of 0.97. Gradients are clipped at 3 and a batch size of 10 is adopted. Sentences with similar lengths are batched together. The L2 regularization parameter is set to 0.001.

Development Experiments
We use the movie review development data to investigate different configurations of S-LSTMs and BiLSTMs. For S-LSTMs, the default configuration uses s and /s words for augmenting words Hyperparameters: Table 2 shows the development results of various S-LSTM settings, where Time refers to training time per epoch. Without the sentence-level node, the accuracy of S-LSTM drops to 81.76%, demonstrating the necessity of global information exchange. Adding one additional sentence-level node as described in Section 3.2 does not lead to accuracy improvements, although the number of parameters and decoding time increase accordingly. As a result, we use only 1 sentence-level node for the remaining experiments. The accuracies of S-LSTM increases as the hidden layer size for each node increases from 100 to 300, but does not further increase when the size increases beyond 300. We fix the hidden size to 300 accordingly. Without using s and /s , the performance of S-LSTM drops from 82.64% to 82.36%, showing the effectiveness of having these additional nodes. Hyperparameters for BiLSTM models are also set according to the development data, which we omit here.
State transition. In Table 2, the number of recurrent state transition steps of S-LSTM is decided according to the best development performance. Figure 2 draws the development accuracies of S-LSTMs with various window sizes against the number of recurrent steps. As can be seen from the figure, when the number of time steps increases from 1 to 11, the accuracies generally increase, before reaching a maximum value. This shows the effectiveness of recurrent information exchange in S-LSTM state transition.
On the other hand, no significant differences are observed on the peak accuracies given by different window sizes, although a larger window size (e.g.

Model
Time (  4) generally results in faster plateauing. This can be be explained by the intuition that information exchange between distant nodes can be achieved using more recurrent steps under a smaller window size, as can be achieved using fewer steps under a larger window size. Considering efficiency, we choose a window size of 1 for the remaining experiments, setting the number of recurrent steps to 9 according to Figure 2. S-LSTM vs BiLSTM: As shown in Table  3, BiLSTM gives significantly better accuracies compared to uni-directional LSTM 2 , with the training time per epoch growing from 67 seconds to 106 seconds. Stacking 2 layers of BiLSTM gives further improvements to development results, with a larger time of 207 seconds. 3 layers of stacked BiLSTM does not further improve the results. In contrast, S-LSTM gives a development result of 82.64%, which is significantly better compared to 2-layer stacked BiLSTM, with a smaller number of model parameters and a shorter time of 65 seconds.
We additionally make comparisons with stacked CNNs and hierarchical attention (Vaswani et al., 2017), shown in Table 3 (the CNN and Transformer rows), where N indicates the number of attention layers. CNN is the most efficient among all models compared, with the smallest model size. On the other hand, a 3-layer stacked CNN gives an accuracy of 81.46%, which is also  the lowest compared with BiLSTM, hierarchical attention and S-LSTM. The best performance of hierarchical attention is between single-layer and two-layer BiLSTMs in terms of both accuracy and efficiency. S-LSTM gives significantly better accuracies compared with both CNN and hierarchical attention.
Influence of external attention mechanism. Table 3 additionally shows the results of BiLSTM and S-LSTM when external attention is used as described in Section 3.3. Attention leads to improved accuracies for both BiLSTM and S-LSTM in classification, with S-LSTM still outperforming BiLSTM significantly. The result suggests that external techniques such as attention can play orthogonal roles compared with internal recurrent structures, therefore benefiting both BiLSTMs and S-LSTMs. Similar observations are found using external CRF layers for sequence labelling.

Final Results for Classification
The final results on the movie review and rich text classification datasets are shown in Tables 4 and 5, respectively. In addition to training time per epoch, test times are additionally reported. We use the best settings on the movie review development dataset for both S-LSTMs and BiLSTMs. The step number for S-LSTMs is set to 9.
As shown in Table 4, the final results on the movie review dataset are consistent with the development results, where S-LSTM outperforms BiL-STM significantly, with a faster speed. Observations on CNN and hierarchical attention are consistent with the development results. S-LSTM also gives highly competitive results when compared with existing methods in the literature. As shown in Table 5, among the 16 datasets of Liu et al. (2017), S-LSTM gives the best results on 12, compared with BiLSTM and 2 layered BiL-STM models. The average accuracy of S-LSTM is 85.6%, significantly higher compared with 84.9% by 2-layer stacked BiLSTM. 3-layer stacked BiL-STM gives an average accuracy of 84.57%, which is lower compared to a 2-layer stacked BiLSTM, with a training time per epoch of 423.6 seconds. The relative speed advantage of S-LSTM over BiLSTM is larger on the 16 datasets as compared to the movie review test test. This is because the average length of inputs is larger on the 16 datasets (see Section 4.5).

Final Results for Sequence Labelling
Bi-directional RNN-CRF structures, and in particular BiLSTM-CRFs, have achieved the state of the art in the literature for sequence labelling tasks, including POS-tagging and NER. We compare S-LSTM-CRF with BiLSTM-CRF for sequence labelling, using the same settings as decided on the movie review development experiments for both BiLSTMs and S-LSTMs. For the latter, we decide   the number of recurrent steps on the respective development sets for sequence labelling. The POS accuracies and NER F1-scores against the number of recurrent steps are shown in Figure 3 (a) and (b), respectively. For POS tagging, the best step number is set to 7, with a development accuracy of 97.58%. For NER, the step number is set to 9, with a development F1-score of 94.98%.
As can be seen in Table 6, S-LSTM gives significantly better results compared with BiLSTM on the WSJ dataset. It also gives competitive accuracies as compared with existing methods in the literature. Stacking two layers of BiLSTMs leads to improved results compared to one-layer BiL-STM, but the accuracy does not further improve  with three layers of stacked LSTMs. For NER (Table 7), S-LSTM gives an F1-score of 91.57% on the CoNLL test set, which is significantly better compared with BiLSTMs. Stacking more layers of BiLSTMs leads to slightly better F1-scores compared with a single-layer BiL-STM. Our BiLSTM results are comparable to the results reported by Ma and Hovy (2016) and Lample et al. (2016), who also use bidirectional RNN-CRF structures. In contrast, S-LSTM gives the best reported results under the same settings.
In the second section of Table 7 learning using additional language model objectives, obtaining an F-score of 86.26%; Peters et al. (2017) leverage character-level language models, obtaining an F-score of 91.93%, which is the current best result on the dataset. All the three models are based on BiLSTM-CRF. On the other hand, these semi-supervised learning techniques are orthogonal to our work, and can potentially be used for S-LSTM also.

Analysis
Figure 4 (a) and (b) show the accuracies against the sentence length on the movie review and CoNLL datasets, respectively, where test samples are binned in batches of 80. We find that the performances of both S-LSTM and BiLSTM decrease as the sentence length increases. On the other hand, S-LSTM demonstrates relatively better robustness compared to BiLSTMs. This confirms our intuition that a sentence-level node can facilitate better non-local communication. these comparisons, we mix all training instances, order them by the size, and put them into 10 equal groups, the medium sentence lengths of which are shown. As can be seen from the figure, the speed advantage of S-LSTM is larger when the size of the input text increases, thanks to a fixed number of recurrent steps. Similar to hierarchical attention (Vaswani et al., 2017), there is a relative disadvantage of S-LSTM in comparison with BiLSTM, which is that the memory consumption is relatively larger. For example, over the movie review development set, the actual GPU memory consumption by S-LSTM, BiLSTM, 2-layer stacked BiLSTM and 4-layer stacked BiLSTM are 252M, 89M, 146M and 253M, respectively. This is due to the fact that computation is performed in parallel by S-LSTM and hierarchical attention.

Conclusion
We have investigated S-LSTM, a recurrent neural network for encoding sentences, which offers richer contextual information exchange with more parallelism compared to BiLSTMs. Results on a range of classification and sequence labelling tasks show that S-LSTM outperforms BiLSTMs using the same number of parameters, demonstrating that S-LSTM can be a useful addition to the neural toolbox for encoding sentences.
The structural nature in S-LSTM states allows straightforward extension to tree structures, resulting in highly parallelisable tree LSTMs. We leave such investigation to future work. Next directions also include the investigation of S-LSTM to more NLP tasks, such as machine translation.