Hierarchical Attention Networks for Document Classification

We propose a hierarchical attention network for document classiﬁcation. Our model has two distinctive characteristics: (i) it has a hierarchical structure that mirrors the hierarchical structure of documents; (ii) it has two levels of attention mechanisms applied at the word-and sentence-level, enabling it to attend differentially to more and less important content when constructing the document representation. Experiments conducted on six large scale text classiﬁcation tasks demonstrate that the proposed architecture outperform previous methods by a substantial margin. Visualization of the attention layers illustrates that the model selects qualitatively informative words and sentences.


Introduction
Text classification is one of the fundamental task in Natural Language Processing. The goal is to assign labels to text. It has broad applications including topic labeling (Wang and Manning, 2012), sentiment classification (Maas et al., 2011;Pang and Lee, 2008), and spam detection (Sahami et al., 1998). Traditional approaches of text classification represent documents with sparse lexical features, such as n-grams, and then use a linear model or kernel methods on this representation (Wang and Manning, 2012;Joachims, 1998). More recent approaches used deep learning, such as convolutional neural networks (Blunsom et al., 2014) and recurrent neural networks based on long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) to learn text representations. pork belly = delicious . || scallops? || I don't even like scallops, and these were a-m-a-z-i-n-g . || fun and tasty cocktails. || next time I in Phoenix, I will go back here. || Highly recommend. Although neural-network-based approaches to text classification have been quite effective (Kim, 2014;Zhang et al., 2015;Johnson and Zhang, 2014;Tang et al., 2015), in this paper we test the hypothesis that better representations can be obtained by incorporating knowledge of document structure in the model architecture. The intuition underlying our model is that not all parts of a document are equally relevant for answering a query and that determining the relevant sections involves modeling the interactions of the words, not just their presence in isolation.
Our primary contribution is a new neural architecture ( §2), the Hierarchical Attention Network (HAN) that is designed to capture two basic insights about document structure. First, since documents have a hierarchical structure (words form sentences, sentences form a document), we likewise construct a document representation by first building representations of sentences and then aggregating those into a document representation. Second, it is observed that different words and sentences in a documents are differentially informative. Moreover, the impor-tance of words and sentences are highly context dependent, i.e. the same word or sentence may be differentially important in different context ( §3.5). To include sensitivity to this fact, our model includes two levels of attention mechanisms (Bahdanau et al., 2014;) -one at the word level and one at the sentence level -that let the model to pay more or less attention to individual words and sentences when constructing the representation of the document. To illustrate, consider the example in Fig. 1, which is a short Yelp review where the task is to predict the rating on a scale from 1-5. Intuitively, the first and third sentence have stronger information in assisting the prediction of the rating; within these sentences, the word delicious, a-m-a-z-i-n-g contributes more in implying the positive attitude contained in this review. Attention serves two benefits: not only does it often result in better performance, but it also provides insight into which words and sentences contribute to the classification decision which can be of value in applications and analysis .
The key difference to previous work is that our system uses context to discover when a sequence of tokens is relevant rather than simply filtering for (sequences of) tokens, taken out of context. To evaluate the performance of our model in comparison to other common classification architectures, we look at six data sets ( §3). Our model outperforms previous approaches by a significant margin.

Hierarchical Attention Networks
The overall architecture of the Hierarchical Attention Network (HAN) is shown in Fig. 2. It consists of several parts: a word sequence encoder, a word-level attention layer, a sentence encoder and a sentence-level attention layer. We describe the details of different components in the following sections.

GRU-based sequence encoder
The GRU (Bahdanau et al., 2014) uses a gating mechanism to track the state of sequences without using separate memory cells. There are two types of gates: the reset gate r t and the update gate z t . They together control how information is updated to the word encoder word attention state. At time t, the GRU computes the new state as This is a linear interpolation between the previous state h t−1 and the current new stateh t computed with new sequence information. The gate z t decides how much past information is kept and how much new information is added. z t is updated as: where x t is the sequence vector at time t. The candidate stateh t is computed in a way similar to a traditional recurrent neural network (RNN): Here r t is the reset gate which controls how much the past state contributes to the candidate state. If r t is zero, then it forgets the previous state. The reset gate is updated as follows:

Hierarchical Attention
We focus on document-level classification in this work. Assume that a document has L sentences s i and each sentence contains T i words. w it with t ∈ [1, T ] represents the words in the ith sentence. The proposed model projects the raw document into a vector representation, on which we build a classifier to perform document classification. In the following, we will present how we build the document level vector progressively from word vectors by using the hierarchical structure.
Word Encoder Given a sentence with words w it , t ∈ [0, T ], we first embed the words to vectors through an embedding matrix W e , x ij = W e w ij . We use a bidirectional GRU (Bahdanau et al., 2014) to get annotations of words by summarizing information from both directions for words, and therefore incorporate the contextual information in the annotation. The bidirectional GRU contains the forward GRU − → f which reads the sentence s i from w i1 to w iT and a backward GRU ← − f which reads from w iT to w i1 : We obtain an annotation for a given word w it by concatenating the forward hidden state − → h it and backward hidden state , which summarizes the information of the whole sentence centered around w it .
Note that we directly use word embeddings. For a more complete model we could use a GRU to get word vectors directly from characters, similarly to (Ling et al., 2015). We omitted this for simplicity.
Word Attention Not all words contribute equally to the representation of the sentence meaning. Hence, we introduce attention mechanism to extract such words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector. Specifically, That is, we first feed the word annotation h it through a one-layer MLP to get u it as a hidden representation of h it , then we measure the importance of the word as the similarity of u it with a word level context vector u w and get a normalized importance weight α it through a softmax function. After that, we compute the sentence vector s i (we abuse the notation here) as a weighted sum of the word annotations based on the weights. The context vector u w can be seen as a high level representation of a fixed query "what is the informative word" over the words like that used in memory networks (Sukhbaatar et al., 2015;Kumar et al., 2015). The word context vector u w is randomly initialized and jointly learned during the training process.
Sentence Encoder Given the sentence vectors s i , we can get a document vector in a similar way. We use a bidirectional GRU to encode the sentences: h i summarizes the neighbor sentences around sentence i but still focus on sentence i.

Sentence Attention
To reward sentences that are clues to correctly classify a document, we again use attention mechanism and introduce a sentence level context vector u s and use the vector to measure the importance of the sentences. This yields where v is the document vector that summarizes all the information of sentences in a document. Similarly, the sentence level context vector can be randomly initialized and jointly learned during the training process.

Document Classification
The document vector v is a high level representation of the document and can be used as features for doc-ument classification: We use the negative log likelihood of the correct labels as training loss: where j is the label of document d.

Data sets
We evaluate the effectiveness of our model on six large scale document classification data sets. These data sets can be categorized into two types of document classification tasks: sentiment estimation and topic classification. The statistics of the data sets are summarized in Table 1. We use 80% of the data for training, 10% for validation, and the remaining 10% for test, unless stated otherwise.
Yelp reviews are obtained from the Yelp Dataset Challenge in 2013, 2014 and 2015 (Tang et al., 2015). There are five levels of ratings from 1 to 5 (higher is better). IMDB reviews are obtained from (Diao et al., 2014). The ratings range from 1 to 10. Yahoo answers are obtained from (Zhang et al., 2015). This is a topic classification task with 10 classes:  (Zhang et al., 2015). The ratings are from 1 to 5. 3,000,000 reviews are used for training and 650,000 reviews for testing. Similarly, we use 10% of the training samples as validation.

Baselines
We compare HAN with several baseline methods, including traditional approaches such as linear methods, SVMs and paragraph embeddings using neural networks, LSTMs, word-based CNN, character-based CNN, and Conv-GRNN, LSTM-GRNN. These baseline methods and results are reported in (Zhang et al., 2015;Tang et al., 2015).

Linear methods
Linear methods (Zhang et al., 2015) use the constructed statistics as features. A linear classifier based on multinomial logistic regression is used to classify the documents using the features.
BOW and BOW+TFIDF The 50,000 most frequent words from the training set are selected and the count of each word is used features. Bow+TFIDF, as implied by the name, uses the TFIDF of counts as features. n-grams and n-grams+TFIDF used the most frequent 500,000 n-grams (up to 5-grams). Bag-of-means The average word2vec embedding (Mikolov et al., 2013) is used as feature set.
Text Features are constructed according to (Kiritchenko et al., 2014), including word and character n-grams, sentiment lexicon features etc. AverageSG constructs 200-dimensional word vectors using word2vec and the average word embeddings of each document are used. SSWE uses sentiment specific word embeddings according to (Tang et al., 2014).

Neural Network methods
The neural network based methods are reported in (Tang et al., 2015) and (Zhang et al., 2015).
CNN-word Word based CNN models like that of (Kim, 2014) are used. CNN-char Character level CNN models are reported in (Zhang et al., 2015). LSTM takes the whole document as a single sequence and the average of the hidden states of all words is used as feature for classification. Conv-GRNN and LSTM-GRNN were proposed by (Tang et al., 2015). They also explore the hierarchical structure: a CNN or LSTM provides a sentence vector, and then a gated recurrent neural network (GRNN) combines the sentence vectors from a document level vector representation for classification.

Model configuration and training
We split documents into sentences and tokenize each sentence using Stanford's CoreNLP (Manning et al., 2014). We only retain words appearing more than 5 times in building the vocabulary and replace the words that appear 5 times with a special UNK token. We obtain the word embedding by training an unsupervised word2vec (Mikolov et al., 2013) model on the training and validation splits and then use the word embedding to initialize W e . The hyper parameters of the models are tuned on the validation set. In our experiments, we set the word embedding dimension to be 200 and the GRU dimension to be 50. In this case a combination of forward and backward GRU gives us 100 dimensions for word/sentence annotation. The word/sentence context vectors also have a dimension of 100, initialized at random.
For training, we use a mini-batch size of 64 and documents of similar length (in terms of the number of sentences in the documents) are organized to be a batch. We find that length-adjustment can accelerate training by three times. We use stochastic gradient descent to train all models with momentum of 0.9. We pick the best learning rate using grid search on the validation set.

Results and analysis
The experimental results on all data sets are shown in Table 2. We refer to our models as HN-{AVE, MAX, ATT}. Here HN stands for Hierarchical Network, AVE indicates averaging, MAX indicates max-pooling, and ATT indicates our proposed hierarchical attention model. Results show that HN-ATT gives the best performance across all data sets.
The improvement is regardless of data sizes. For smaller data sets such as Yelp 2013 and IMDB, our model outperforms the previous best baseline methods by 3.1% and 4.1% respectively. This finding is consistent across other larger data sets. Our model outperforms previous best models by 3.2%, 3.4%, 4.6% and 6.0% on Yelp 2014, Yelp 2015, Yahoo Answers and Amazon Reviews. The improvement also occurs regardless of the type of task: sentiment classification, which includes Yelp 2013-2014, IMDB, Amazon Reviews and topic classification for Yahoo Answers.
From Table 2 we can see that neural network based methods that do not explore hierarchical document structure, such as LSTM, CNN-word, CNNchar have little advantage over traditional methods for large scale (in terms of document size) text classification. E.g. SVM+TextFeatures gives performance 59. 8, 61.8, 62.4, 40.5 for Yelp 20138, 61.8, 62.4, 40.5 for Yelp , 20148, 61.8, 62.4, 40.5 for Yelp , 2015   combined with hierarchical structure improves over previous models (LSTM-GRNN) by 3.1%, 3.4%, 3.5% and 4.1% respectively. More interestingly, in the experiments, HN-AVE is equivalent to using non-informative global word/sentence context vectors (e.g., if they are all-zero vectors, then the attention weights in Eq. 6 and 9 become uniform weights). Compared to HN-AVE, the HN-ATT model gives superior performance across the board. This clearly demonstrates the effectiveness of the proposed global word and sentence importance vectors for the HAN.

Context dependent attention weights
If words were inherently important or not important, models without attention mechanism might work well since the model could automatically assign low weights to irrelevant words and vice versa. However, the importance of words is highly context dependent. For example, the word good may appear in a review that has the lowest rating either because users are only happy with part of the product/service or because they use it in a negation, such as not good. To verify that our model can capture context dependent word importance, we plot the distribution of the attention weights of the words good and bad from the test split of Yelp 2013 data set as shown in Figure 3(a) and Figure 4(a). We can see that the distribution has a attention weight assigned to a word from 0 to 1. This indicates that our model captures diverse context and assign context-dependent weight to the words.
For further illustration, we plot the distribution when conditioned on the ratings of the review. Subfigures 3(b)-(f) in Figure 3 and Figure 4 correspond to the rating 1-5 respectively. In particular, Figure 3(b) shows that the weight of good concentrates on the low end in the reviews with rating 1. As the rating increases, so does the weight distribution. This means that the word good plays a more important role for reviews with higher ratings. We can observe the converse trend in Figure 4 for the word bad. This confirms that our model can capture the context-dependent word importance. tively. Contrary to before, the word bad is considered important for poor ratings and less so for good ones.

Visualization of attention
In order to validate that our model is able to select informative sentences and words in a document, we visualize the hierarchical attention layers in Figures 5  and 6 for several documents from the Yelp 2013 and Yahoo Answers data sets.
Every line is a sentence (sometimes sentences spill over several lines due to their length). Red denotes the sentence weight and blue denotes the word weight. Due to the hierarchical structure, we normalize the word weight by the sentence weight to make sure that only important words in important sentences are emphasized. For visualization purposes we display √ p s p w . The √ p s term displays the important words in unimportant sentences to ensure that they are not totally invisible. Figure 5 shows that our model can select the words carrying strong sentiment like delicious, amazing, terrible and their corresponding sentences.
Sentences containing many words like cocktails, pasta, entree are disregarded. Note that our model can not only select words carrying strong sentiment, it can also deal with complex across-sentence context. For example, there are sentences like i don't even like scallops in the first document of Fig. 5, if looking purely at the single sentence, we may think this is negative comment. However, our model looks at the context of this sentence and figures out this is a positive review and chooses to ignore this sentence.
Our hierarchical attention mechanism also works well for topic classification in the Yahoo Answer data set. For example, for the left document in Figure 6 with label 1, which denotes Science and Mathematics, our model accurately localizes the words zebra, strips, camouflage, predator and their corresponding sentences. For the right document with label 4, which denotes Computers and Internet, our model focuses on web, searches, browsers and their corresponding sentences. Note that this happens in a multiclass setting, that is, detection happens before the selection of the topic! 4 Related Work Kim (2014) use neural networks for text classification. The architecture is a direct application of CNNs, as used in computer vision (LeCun et al., 1998), albeit with NLP interpretations. Johnson and Zhang (2014) explores the case of directly using a high-dimensional one hot vector as input. They find that it performs well. Unlike word level modelings, Zhang et al. (2015) apply a character-level CNN for text classification and achieve competitive results. Socher et al. (2013) use recursive neural networks for text classification. Tai et al. (2015) GT: 4 Prediction: 4 pork belly = delicious . scallops ? i do n't . even . like . scallops , and these were a-m-a-z-i-n-g . fun and tasty cocktails . next time i 'm in phoenix , i will go back here . highly recommend .  GT: clean up temporary internet files . " explore the structure of a sentence and use a treestructured LSTMs for classification. There are also some works that combine LSTM and CNN structure to for sentence classification (Lai et al., 2015;. Tang et al. (2015) use hierarchical structure in sentiment classification. They first use a CNN or LSTM to get a sentence vector and then a bi-directional gated recurrent neural network to compose the sentence vectors to get a document vectors. There are some other works that use hierarchical structure in sequence generation  and language modeling (Lin et al., 2015). The attention mechanism was proposed by (Bahdanau et al., 2014) in machine translation. The encoder decoder framework is used and an attention mechanism is used to select the reference words in original language for words in foreign language before translation.  uses the attention mechanism in image caption generation to select the relevant image regions when generating words in the captions. Further uses of the attention mechanism include parsing (Vinyals et al., 2014), natural language question answering (Sukhbaatar et al., 2015;Kumar et al., 2015;Hermann et al., 2015), and image question answering . Unlike these works, we explore a hierarchical attention mechanism (to the best of our knowledge this is the first such instance).

Conclusion
In this paper, we proposed hierarchical attention networks (HAN) for classifying documents. As a convenient side-effect we obtained better visualization using the highly informative components of a document. Our model progressively builds a document vector by aggregating important words into sentence vectors and then aggregating important sentences vectors to document vectors. Experimental results demonstrate that our model performs significantly better than previous methods. Visualization of these attention layers illustrates that our model is effective in picking out important words and sentences.