A Deep Network with Visual Text Composition Behavior

While natural languages are compositional, how state-of-the-art neural models achieve compositionality is still unclear. We propose a deep network, which not only achieves competitive accuracy for text classification, but also exhibits compositional behavior. That is, while creating hierarchical representations of a piece of text, such as a sentence, the lower layers of the network distribute their layer-specific attention weights to individual words. In contrast, the higher layers compose meaningful phrases and clauses, whose lengths increase as the networks get deeper until fully composing the sentence.


Introduction
Deep neural networks leverage task-specific architectures to develop hierarchical representations of the input, where higher level representations are derived from lower level features (Conneau et al., 2016). Such hierarchical representations have visually demonstrated compositionality in image processing, i.e., pixels combine to form shapes and then contours (Farabet et al., 2013;Zeiler and Fergus, 2014). Natural languages are also compositional, i.e., words combine to form phrases and then sentences. Yet unlike in vision, how deep neural models in NLP, which mainly operate on distributed word embeddings, achieve compositionality, is still unclear (Li et al., 2015(Li et al., , 2016. We propose an Attention Gated Transformation (AGT) network, where each layer's feature generation is gated by a layer-specific attention mechanism (Bahdanau et al., 2014). Specifically, through distributing its attention to the original given text, each layer of the networks tends to in-crementally retrieve new words and phrases from the original text. The new knowledge is then combined with the previous layer's features to create the current layer's representation, thus resulting in composing longer or new phrases and clauses while creating higher layers' representations of the text.
Experiments on the Stanford Sentiment Treebank (Socher et al., 2013) dataset show that the AGT method not only achieves very competitive accuracy, but also exhibits compositional behavior via its layer-specific attention. We empirically show that, given a piece of text, e.g., a sentence, the lower layers of the networks select individual words, e.g, negative and conjunction words not and though, while the higher layers aim at composing meaningful phrases and clauses such as negation phrase not so much, where the phrase length increases as the networks get deeper until fully composing the whole sentence. Interestingly, after composing the sentence, the compositions of different sentence phrases compete to become the dominating features of the end task.

Attention Gated Transformation Network
Our AGT network was inspired by the Highway Networks (Srivastava et al., 2015a,b), where each layer is equipped with a transform gate.

Transform Gate for Information Flow
Consider a feedforward neural network with multiple layers. Each layer l typically applies a nonlinear transformation f (e.g., tanh, parameterized by W f l ), on its input, which is the output of the most recent previous layer (i.e., y l−1 ), to produce its output y l . Here, l = 0 indicates the first layer and y 0 is equal to the given input text x, namely y 0 = x: While in a highway network (the left column of Figure 1), an additional non-linear transform gate function T l is added to the l th (l > 0) layer: where the function T l expresses how much of the representation y l is produced by transforming the y l−1 (first term in Equation 2), and how much is just carrying from y l−1 (second term in Equation 2). Here T l is typically defined as: where W t l is the weight matrix and b t l the bias vector; σ is the non-linear activation function.
With transform gate T , the networks learn to decide if a feature transformation is needed at each layer. Suppose σ represents a sigmoid function. In such case, the output of T lies between zero and one. Consequently, when the transform gate is one, the networks pass through the transformation f over y l−1 and block the pass of input y l−1 ; when the gate is zero, the networks pass through the unmodified y l−1 , while the transformation f over y l−1 is suppressed.
The left column of Figure 1 reflects the highway networks as proposed by (Srivastava et al., 2015b). Our AGT method adds the right two columns of Figure 1. That is, 1) the transform gate T l now is not a function of y l−1 , but a function of the selection vector s + l , which is determined by the attention distributed to the given input x by the l th layer (will be discussed next), and 2) the function f takes as input the concatenation of y l−1 and s + l to create feature representation y l . These changes result in an attention gated transformation when forming hierarchical representations of the text.

Attention Gated Transformation
In AGT, the activation of the transform gate at each layer depends on a layer-specific attention mechanism. Formally, given a piece of text x, such as a sentence with N words, it can be represented as a matrix B ∈ IR N ×d . Each row of the matrix corresponds to one word, which is represented by a ddimensional vector as provided by a learned word embedding table. Consequently, the selection vector s + l , for the l th layer, is the softmax weighted sum over the N word vectors in B: with the weight (i.e., attention) d l,n computed as: here, w m l and W m l are the weight vector and weight matrix, respectively. By varying the attention weight d l,n , the s + l can focus on different rows of the matrix B, namely different words of the given text x, as illustrated by different color curves connecting to s + in Figure 1. Intuitively, one can consider s + as a learned word selection component: choosing different sets of words of the given text x by distributing its distinct attention.
Having built one s + for each layer from the given text x, the activation of the transform gate for layer l (l > 0) (i.e., Equation 3) is calculated: To generate feature representation y l , the function f takes as input the concatenation of y l−1 and s + l . That is, Equation 2 becomes: where [...;...] denotes concatenation. Thus, at each layer l, the gate T l can regulate either passing through y l−1 to form y l , or retrieving novel knowledge from the input text x to augment y l−1 to create a better representation for y l . Finally, as depicted in Figure 1, the feature representation of the last layer of the AGT is fed into two fully connected layers followed by a softmax function to produce a distribution over the possible target classes. For training, we use multi-class cross entropy loss.
Note that, Equation 8 indicates that the representation y l depends on both s + l and y l−1 . In other words, although Equation 7 states that the gate activation at layer l is computed by s + l , the gate activation is also affected by y l−1 , which embeds the information from the layers below l.
Intuitively, the AGT networks are encouraged to consider new words/phrases from the input text at higher layers. Consider the fact that the s + 0 at the bottom layer of the AGT only deploys a linear transformation of the bag-of-words features. If no new words are used at higher layers of the networks, it will be challenge for the AGT to sufficiently explore different combinations of word sets of the given text, which may be important for building an accurate classifier. In contrast, through tailoring its attention for new words at different layers, the AGT enables the words selected by a layer to be effectively combined with words/phrases selected by its previous layers to benefit the accuracy of the classification task (more discussions are presented in Section 3.2).

Main Results
The Stanford Sentiment Treebank data contains 11,855 movie reviews (Socher et al., 2013). We use the same splits for training, dev, and test data as in (Kim, 2014) to predict the finegrained 5-class sentiment categories of the sentences. For comparison purposes, following (Kim, 2014;Kalchbrenner et al., 2014;Lei et al., 2015), we trained the models using both phrases and sentences, but only evaluate sentences at test time. Also, we initialized all of the word embeddings (Cherry and Guo, 2015;Chen and Guo, 2015) using the 300 dimensional pre-trained vectors from GloVe (Pennington et al., 2014). We learned 15 layers with 200 dimensions each, which requires us to project the 300 dimensional word vectors; we implemented this using a linear transformation, whose weight matrix and bias term are shared across all words, followed by a tanh activation. For optimization, we used Adadelta (Zeiler, 2012), with learning rate of 0.0005, mini-batch of 50, transform gate bias of 1, and dropout (Srivastava et al., 2014) Table 1 presents the test-set accuracies obtained by different strategies. Results in Table 1 indicate that the AGT method achieved very competitive accuracy (with 50.5%), when compared to the state-of-the-art results obtained by the tree-LSTM (51.0%) (Tai et al., 2015;Zhu et al., 2015) and high-order CNN approaches (51.2%) (Lei et al., 2015).
Top subfigure in Figure 2 depicts the distribu-  , are normalized to the range between 0 and 1 within the sentence. The figure indicates that AGT generated very spiky attention distribution. That is, most of the attention weights are either 1 or 0. Based on these narrow, peaked bell curves formed by the normal distributions for the attention weights of 1 and 0, we here consider a word has been selected by the networks if its attention weight is larger than 0.95, i.e., receiving more than 95% of the full attention, and a phrase has been composed and selected if a set of consecutive words all have been selected.
In the bottom subfigure of Figure 2 we present the distribution of the phrase lengths on the test set. This figure indicates that the middle layers of the networks e.g., 8 th and 9 th , have longer phrases (green and blue curves) than others, while the layers at the two ends contain shorter phrases (red and pink curves).
In Figure 3, we also presented the transform gate activities on all test sentences (top) and that of the first example sentence in Figure 4 (bottom). These curves suggest that the transform gates at the middle layers (green and blue curves) tended to be close to zero, indicating the pass-through of lower layers' representations. On the contrary, the gates at the two ends (red and pink curves) tended to be away from zero with large tails, implying the retrieval of new knowledge from the input text. These are consistent with the results below. Figure 4 presents three sentences with various lengths from the test set, with the attention weights numbered and then highlighted in heat map. Figure 4 suggests that the lower layers of the networks selected individual words, while the higher layers aimed at phrases. For example, the first and second layers seem to select individual words carrying strong sentiment (e.g., predictable, bad, never and delicate), and conjunction and negation words (e.g., though and not). Also, meaningful phrases were composed and selected by later layers, such as not so much, not only... but also, bad taste, bad luck, emotional development, and big screen. In addition, in the middle layer, i.e., the 8 th layer, the whole sentences were composed by filtering out uninformative words, resulting in concise versions, as follows (selected words and phrases are highlighted in color blocks). 1) though plot predictable movie never feels formulaic attention nuances emotional development delicate characters 2) bad company leaves bad taste not only bad luck but also staleness script 3) not so much movie picture big screen Interestingly, if relaxing the word selection criteria, e.g., including words receiving more than the median, rather than 95%, of the full attention, the sentences recruited more conjunction and modification words, e.g., because, for, a, its and on, thus becoming more readable and fluent: 1) though plot is predictable movie never feels formulaic because attention is on nuances emotional development delicate characters 2) bad company leaves a bad taste not only because its bad luck timing but also staleness its script 3) not so much a movie a picture book for big screen Now, consider the AGT's compositional behavior for a specific sentence, e.g., the last sentence in Figure 4. The first layer solely selected the word not (with attention weight of 1 and all other words with weights close to 0), but the 2 nd to 4 th layers gradually pulled out new words book, screen and movie from the given text. Incrementally, the 5 th and 6 th layers further selected words to form phrases not so much, picture book, and big screen. Finally, the 7 th and 8 th layers added some    conjunction and quantification words a and for to make the sentence more fluent. This recursive composing process resulted in the sentence "not so much a movie a picture book for big screen". Interestingly, Figures 4 and 2 also imply that, after composing the sentences by the middle layer, the AGT networks shifted to re-focus on shorter phrases and informative words. Our analysis on the transform gate activities suggests that, during this re-focusing stage the compositions of sentence phrases competed to each others, as well as to the whole sentence composition, for the dominating task-specific features to represent the text.

Further Observations
As discussed at the end of Section 2.2, intuitively, including new words at different layers allows the networks to more effectively explore different combinations of word sets of the given text than that of using all words only at the bottom layer of the networks. Empirically, we observed that, if with only s + 0 in the AGT network, namely removing s + i for i > 0, the test-set accuracy dropped from 50.5% to 48.5%. In other words, transforming a linear combination of the bag-of-words features was insufficient for obtaining sufficient accuracy for the classification task. For instance, if being augmented with two more selection vectors s + i , namely removing s + i for i > 2, the AGT was able to improve its accuracy to 49.0%. Also, we observed that the AGT networks tended to select informative words at the lower layers. This may be caused by the recursive form of Equation 8, which suggests that the words retrieved by s + 0 have more chance to combine with and influence the selection of other feature words. In our study, we found that, for example, the top 3 most frequent words selected by the first layer of the AGT networks were all negation words: n't, never, and not. These are important words for sentiment classification (Zhu et al., 2014).
In addition, like the transform gate in the Highway networks (Srivastava et al., 2015a) and the forget gate in the LSTM (Gers et al., 2000), the attention-based transform gate in the AGT networks is sensitive to its bias initialization. We found that initializing the bias to one encouraged the compositional behavior of the AGT networks.

Conclusion and Future Work
We have presented a novel deep network. It not only achieves very competitive accuracy for text classification, but also exhibits interesting text compositional behavior, which may shed light on understanding how neural models work in NLP tasks. In the future, we aim to apply the AGT networks to incrementally generating natural text (Guo, 2015;Hu et al., 2017).