PRADO: Projection Attention Networks for Document Classification On-Device

Recently, there has been a great interest in the development of small and accurate neural networks that run entirely on devices such as mobile phones, smart watches and IoT. This enables user privacy, consistent user experience and low latency. Although a wide range of applications have been targeted from wake word detection to short text classification, yet there are no on-device networks for long text classification. We propose a novel projection attention neural network PRADO that combines trainable projections with attention and convolutions. We evaluate our approach on multiple large document text classification tasks. Our results show the effectiveness of the trainable projection model in finding semantically similar phrases and reaching high performance while maintaining compact size. Using this approach, we train tiny neural networks just 200 Kilobytes in size that improve over prior CNN and LSTM models and achieve near state of the art performance on multiple long document classification tasks. We also apply our model for transfer learning, show its robustness and ability to further improve the performance in limited data scenarios.


Introduction
One of the fundamental tasks in Natural Language Processing is related to long text classification. Given a document, the goal is to assign one or more categories of interest to the text. This task is of high importance as it has wide applications in spam detection (Jindal and Liu, 2007), product categorization (Kozareva, 2015), sentiment classification (Pang and Lee, 2008) and it also plays an important role for improving document retrieval and ranking (Deerwester et al., 1990).
For a long time, the most successful text classification approaches relied on sparse lexical fea-tures such as n-grams, which are later used by linear or kernel models (Joachims, 1998;McCallum and Nigam, 1998;Joulin et al., 2016). However, with the recent advancements in deep learning, various neural network architectures like CNN (Kim, 2014), LSTM , hierarchical attention mechanisms (Yang et al., 2016) showed improvement in performance.
Recently, (Ravi and Kozareva, 2018) and (Ravi and Kozareva, 2019) showed the importance of building on-device neural models for short text classification, which preserve user privacy, enable consistent user experience and most importantly perform inference on the device. One of the biggest challenges is how to fit these large and complex neural networks on devices with limited memory and computation capacity while still maintaining high performance. Kozareva, 2018, 2019) developed ondevice self-governing neural networks (SGNN) and (SGNN++) based on locality-sensitive projections (Ravi, 2017(Ravi, , 2019. Those methods were evaluated on short text classification tasks such as dialog act and user intent understanding and outperformed prior RNN work (Khanpour et al., 2016;Ortega and Vu, 2017).
In this work, we take one step further by proposing a novel projection attention neural network called PRADO . Unlike SGNN which has static projections, PRADO combines trainable projections with attention and convolutions allowing it to capture long range dependencies and making it a powerful and flexible approach for long text classification. We study the impact of different hyperparameters on accuracy vs model size. We also address the problem of producing compact architectures by develop a quantized version of PRADO . In a series of experimental evaluations on multiple long text classification tasks, we show that our approach PRADO improves over prior baselines and neural networks such as character and word level CNNs and LSTMs. The main contributions of our work are as follows: • Novel on-device projection attention neural network PRADO which combines transferable projections with attention and convolution for long text classification.
• Exhaustive experimental evaluation on multiple long text classification tasks, outperforming traditional feature engineered linear classifiers and deep learning approaches like CNN and LSTM.
• Quantized PRADO network, which results in tiny 200 Kilobytes in size model that improves over prior CNN and LSTM models.
• Applicability of PRADO for transfer learning, showed its robustness and ability to further improve performance in limited data scenarios.

Related Work
Early work on text classification relied on sparse lexical features such as n-grams and linear classifiers (Joachims, 1998;McCallum and Nigam, 1998;Joulin et al., 2016). But with the raise of deep learning, various CNN and LSTM approaches lead to significant improvement in performance and reaching state-of-the-art results. (Kim, 2014) used CNN architecture from computer vision for text classification. (Johnson and Zhang, 2015), used high-dimensional one hot vector and later introduced character-level CNN that achieved even more competitive results. (Tai et al., 2015) used tree structured LSTM for classification, while (Tang et al., 2015) use CNN or LSTM to capture sentence vector followed by bidirectional gated recurrent network which composes the vectors to get a document vector. Recently, (Yang et al., 2016) introduced hierarchical attention neural networks, which captures document representation by incorporating knowledge of the document structure into the model. This approach reaches the best performance on large set of text classification tasks. The aforementioned prior work mostly focuses on building the best neural network model independent of any model size or memory constrains. However, recent work by Kozareva, 2018, 2019) show the importance of building ondevice text classification models that can preserve user privacy, provide consistent user experience and most importantly are compact in size, while yet achieving state-of-art results. Previously, to build lightweight text classification approaches (Ravi, 2013) proposed fast sampling techniques, while (Bui et al., 2018) incorporated deep neural networks with graph learning. While successful, such approaches resulted in large models for response completion (Pang and Ravi, 2012) and Smart Reply (Kannan et al., 2016).
To address the challenge of fitting huge deep neural network on-device, (Ravi and Kozareva, 2018) developed a novel self-governing neural networks (SGNNs) that learns projections on the fly leading to small models. SGNN was applied on short text classification tasks such as dialog act and user intent understanding and showed significant improvement over state-of-the-art RNN (Khanpour et al., 2016) and RNN with attention (Ortega and Vu, 2017) approaches. In this work, we take one step further by developing trainable projection network with attention mechanism that captures long range dependencies making it a powerful and flexible approach for long text classification. In addition, we address the problem of producing compact architectures for text classification when we have limited amount of memory. FastText (Joulin et al., 2016) proposed product quantization to store word embeddings and have carried out evaluation which show models that require two orders of magnitude less memory. We use quantization techniques and show that we achieve 10x to 100x compression rate while still maintaining high performance and improving upon prior CNN and LSTM work. Unlike prior on-device text classification work, we also apply our model in a transfer learning scenario, which demonstrated the robustness of our approach and ability to further improve performance in limited data scenarios. Next, we describe the technical details of our approach, followed by experimental evaluation and results. Figure 1 shows the overall architecture of our proposed network PRADO . It consists of a projected embedding layer, a convolutional and attention encoder mechanism and a final classification layer. We describe each component in detail below and contrast them with existing methods.

Projected Embedding Layer
Let us assume that the input text has T tokens or words. w i represents the i-th word, where i ∈ {0, ...T − 1}. If V is the number of words in the vocabulary, including an out of vocabulary token that represents all missing words, then each word w i is mapped to δ i ∈ R V . The first component in most neural networks designed for language tasks uses an embedding layer with trainable parameters W ∈ R d·V to map words to fixed length d-dimensional vectors e i = W δ i , where e i ∈ R d are the word vectors that are processed by the rest of the network. A large fraction of the parameters in the network is concentrated in W , since V often has to be large (upto hundreds of thousands or millions of words) to obtain good performance. Furthermore, by choosing V upfront we are assuming that words or phrases relevant for the classification task are known apriori, which may not be true. It should be noted that though we express the operation to obtain the word vector e i as a matrix multiplication, in reality it is a look-up of the corresponding row in the embedding matrix as δ i is modeled using the Dirac delta function. Embeddings via Trainable Projections: Our approach PRADO replaces this embedding with a projection approach to build the word encoder. Instead of mapping w i to δ i , we map it to f i using a projection operator P. Recent work (Ravi, 2017;Ravi and Kozareva, 2018;Ravi, 2019) has shown that projection-based neural approaches can help train compact neural networks that achieve good performance on certain language tasks. These networks learn robust representations (Sankar et al., 2019a) that can be also transferred to other tasks (Sankar et al., 2019b). We follow a similar strategy but unlike the static projections used in these works, we propose a new type of projection that decomposes the operation and makes the projection trainable, leading to more powerful encoders capable of capturing contextual information for long-text classification while maintaining a very low memory footprint. Our method does not rely on a fixed vocabulary. The projection operator we use in this work first fingerprints the words and extracts B bit features from the fingerprint. The word vectors e i are obtained using a neural network layer e i = φ(f i ). This allows us to generate compact embeddings, train the projection encoder layer better and apply further optimizations like batch normalization (Ioffe and Szegedy, 2015) in this step. First, individual tokens w i in the input text are fingerprinted using a hashing function to generate 2B bits. The projection operator P then maps every consecutive two-bit sequence to the set {−1, 0, 1} resulting in a vector f i ∈ {−1, 0, 1} B . We chose this specific form since it would enable further optimization such as those described in (Li and Liu, 2016) to reduce the computation in the first layer. We note that there could be alternative modeling choices for the specific form of the projection operator P. Any projection operator that maps bits from the fingerprint to a bounded range is expected to perform equally well. φ(.) is a trainable function with B · d parameters that maps B-dimensional projection features into d-dimensional word embedding vectors that are computed dynamically during training and inference. In practice, B ∈ [128, 512] and d ∈ [32,96] are tiny compared to V .

Convolution & Attention over Projections
Next, we introduce an encoder mechanism to map a sequence of projected word embeddings e 0 ...e T −1 to a fixed length vector that represents the entire input text. There are many studies that use 1d convolutions on the word vectors and perform pooling to reduce the sequence to a fixed length vector. But we observe that most words or tokens in a sentence are not relevant for any classification problem, as a result methods like average pooling after convolutions do not effectively reduce the most relevant information needed for the task, especially when the text contains several sen-tences. Other pooling methods like max or min do not let gradients flow effectively during backpropagation, making it difficult to train the network. To overcome this, we propose a method that uses convolutions and attention mechanism over the word projections and generate a fixed length encoding for the input text. Projected Attention: In our approach, we use two independent convolutional networks for this step. First one, which we refer to as the projected feature network F captures the features that are useful for the classification task. This is comparable to the convolution networks used in existing studies except we perform convolutions over the sequence of projected word embeddings e i .
where n is the convolution kernel width, N is the number of output channels in the convolution and F n i ∈ R N . The second one, which we refer to as attention network A, captures the importance of these features for the task.
We compute softmax over the sequence dimension of the results of A. This provides a distribution over the word sequence that captures the relevance of features at different positions.
We compute an expectation using distribution A on F that turns the sequence into a fixed length encoding E n for convolution kernel n.
Our pooling scheme reduces to average pooling if the A n i is uniform over the sequence dimension and it becomes max or min pooling if A n i is a Dirac delta in the maximum or minimum value.

Sequence Convolution Kernels
For the convolution and attention encoder E n above, we separately apply n-gram kernels of varying sizes n = 1, 2, 3, .... In addition to ngrams, we used masked convolution kernels that simulate the effect of skip-grams. The masking effectively zeros out certain entries in the convolution kernel as shown in Figure 2. Each kernel n generates a corresponding fixed length encoding E n of the input sequence.

Classification Layer
The convolution kernel width n makes the network react to various word n-grams with a configurable parameter N for each n-gram. We compute various n-gram and skip-gram convolution features and concatenate them to form a fixed length representation for the input document x k : Finally, we use a feed-forward network with fullyconnected layer over the fixed length text encoding for classification.
We train the network with cross entropy loss and apply softmax over the output layer to obtain predicted probabilities y C k for each class C during inference.

Data Sets
We evaluate the performance of our approach on large scale document classification tasks, which are widely used in the research community.
• Yelp reviews from the Yelp Challenge (Tang et al., 2015) with ratings from 1 to 5.
• Amazon reviews from  with ratings from 1 to 5.
• Yahoo Answers from  with documents contain question title, question context and best answer and ten classes such as:  Table 1 shows the characteristics of each data set. We use the same test sets as (Tang et al., 2015).

Experimental Setting
We setup our experimental evaluation, as follows: given a long text classification task and a data set, we construct a model with the hyper-parameters listed in Table 2 and we use a hyper-parameter search technique to find the optimal model for each data set. In addition to the parameters listed in Table 2, the search method also looks for optimal learning rate schedule and regularization scale. For the purpose of hyper-parameter search, we set aside 8% of the training data and use it as development set. The search method optimizes the Accuracy on the development set. Once found, we use the optimal model to evaluate Accuracy on the test set.

Implementation Details
Unlike prior document classification neural networks Tang et al., 2015;Yang et al., 2016) which rely on pre-trained word embeddings, our approach PRADO learns the projection weights on the fly during training (i.e word embeddings (or vocabularies) do not need to be stored). Prior to learning the projections, we did a simple pre-possessing that normalized the text to lowercase, introduced blank space before and after punctuation to make sure they are treated as separate tokens and tokenized the text by space.
We used different regularization scales for the initial fully connected layer and the rest of the network, as the majority of the parameters were in the first layer that computes the word embedding vectors on-the-fly. We used Adam optimizer (Kingma and Ba, 2014) with exponential learn-ing rate schedule. For regularization, we used dropouts after the first layer and also distorted the input text by randomly inserting, deleting and transposing characters in the token with small probability.

Results and Model Comparisons
It is important to recall that the main goal of our work is to develop fast and efficient on-device neural text classification approach, which can achieve near state-of-the-art performance while satisfying the on-device small size and memory resource constrains. Therefore, it is not fair to directly compare PRADO on-device performance against existing approaches which do inference on cloud without constraints. Yet, we compare our approach against well established baselines and prior non-on-device work taking into consideration these differences. Table 3 shows the obtained results for each data set and method.
Baseline Comparison: We use the same baselines as described in Tang et al., 2015). They are traditional approaches, which rely on hand-crafted features such as bagof-words with TFIDF and n-grams with TFIDF, and use linear or kernel classifiers. As it can be seen in Table 3, our PRADO approach consistently outperforms all baseline methods with +4.8 to +12.2 for Yelp, +5.9 to +17 for Amazon and +1.3 to +11.8 for Yahoo data sets. This definitely shows the power of the trainable projection ondevice neural networks and attention mechanism.
On-device Comparison: We also show comparison against prior on-device neural network approach (Ravi and Kozareva, 2018). Their SGNN approach was targeted towards short text classification tasks and as shown in Table 3, our PRADO model achieves upto +40% improvement over SGNN demonstrating that PRADO is more powerful and suited for long text classification.

Model Size: PRADO vs Smaller RNNs
We further compare our PRADO model against smaller-sized variants of widely-used recurrent (LSTM) models. This study helps analyze the effectiveness of PRADO compared to other small  neural models and answer the question: Can popular RNN models be shrunk down to the same size as PRADO and still achieve high performance? To construct baseline neural models at smaller sizes, we use an LSTM architecture with 64 hidden units as the state size and vary the input vocabulary size (i.e., picking top K words ordered by frequency) and embedding dimensions d. Table 4 compares the performance of PRADO with different small and medium-sized LSTM models (achieved by varying K, d). Our results show that PRADO achieves the best performance with the lowest footprint (just 175K parameters) and high compression ratios (up to 100x smaller) compared to standard LSTM models.

PRADO Analysis and Discussion
N-gram Attention Focus: To better understand our model, we analyze PRADO's attention distribution. We used a trained PRADO model from a particular data set and computed the attention distribution of the projected n-gram features for all samples in the test set. Then, we pull the word or word sequence index with the maximum attention and order them by frequency of the class. Table 5 shows examples of the bigrams for 5 and 1 star Yelp reviews. Note, that we did not select the vocabulary for the data, the model automatically learned to associate the words with the classes based on the trained projections. Skipgrams Attention Focus: Similarly, we conduct an analysis for the attention distribution of the skip-1-bigrams channel. Table 5 shows sample of the most frequent entries for the 5 and 1 star Yelp reviews. The model captures and learns basic regular expressions. For instance, the skip gram "waste * time" captures " waste your time"  or "waste of time". Similarly "worst * ever" captures "worst food ever", "worst service ever". Our analysis shows that overall our trainable projection with attention and convolution learns embedding representations that are powerful and capture the semantic similarity of words and phrases. This information helps PRADO during classification.

PRADO Runtime Performance
Our PRADO approach produces compact neural networks with tiny memory footprint. Next, we also show how to further optimize PRADO and help speed up on-device inference during runtime.

Training with quantization
We train a PRADO model variant with 8-bit quantization as described in (Jacob et al., 2018). This procedure simulates the quantization process during training by nudging the weights and activations towards a grid of discrete levels (2 N levels, where N =8 is the number of bits). We estimate the activation ranges for each training batch and use exponential moving average to smooth the quantization ranges across training steps. (Jacob et al., 2018) noted that by training with quantization, they reached similar accuracy with 8-bit models as floating point ones on several image classification and object detection data sets. For text classification, we observed that training with quantization significantly improves accuracy as shown in Table 3. We believe that this is due to the improved regularization as quantization has the highest impact on Yelp. This dataset has relatively few training samples per class (see Table 1) which causes the model to overfit the training data and regularization provided by the operation that sim-ulates quantization during training helps it generalize better. Furthermore, the model size of 8-bit quantized PRADO models is equal to the number of parameters. Figure 3 shows that PRADO can reach the performance reported in the Table 3 with model size of less than 200 Kilobytes. PRADO starts getting competitive results on the same data sets with tiny model size as low as 25 Kilobytes.

Computational Cost for Inference
We evaluate the computational cost of PRADO models for inference wrt floating point (or integer) operations and latency (in milliseconds). The number of floating point or integer operations in our model is dominated by the below factor which includes the operations from the first fully connected layer and the convolutional layers. It can be seen that our method scales linearly with the number of time steps T and the dimension of the projected word embedding d. We measured the latency of processing a document with 1000 words using our quantized PRADO model on a Nexus 5x mobile phone to be between 20 to 40 ms.

Transfer Learning with PRADO
Recent popularity of several pre-trained word embedding approaches can be primarily attributed to their success and effectiveness at transfer learning for language tasks. Model representations trained on a data-rich domain can be leveraged for tasks and domains in limited-data scenarios. However, as we discussed earlier, these methods require storing and looking up huge pre-trained embedding tables unlike our PRADO approach which results in compact models. Next, we evaluate the effectiveness of PRADO at learning robust representations and extend it for transfer learning scenarios. To establish a baseline, we took 10% of Yelp training data and trained and evaluated a baseline PRADO model with random initialization of weights. We compared this to initializing the parameters from a PRADO model trained on the larger Amazon data set. We ran two different experiments with this initialization.
• Experiment A: Full Transfer We freeze all weights in PRADO except the last fully connected classification layer. With this setup We trained the baseline and transfer-learned variants with and without quantization. Figure 4 shows results of the transfer learning runs. We observe that with random initialization, the baseline PRADO model reaches a peak performance of around 57% and starts overfitting on the small Yelp training data set both with and without quantization. When transferring just the projection layer and fine-tuning the convolution and classification layers, training converges quickly and it reaches better peak accuracy. This demonstrates that the PRADO projection embeddings trained in one domain, even though tiny in size, captures useful information that can be leveraged to improve classification in another domain with limited training data. This is further improved when transferring the full PRADO model and fine-tuning just the last layer. In this case (Experiment A), the model converges to 61% and 60% accuracy with and without quantization respectively. The training also converges quicker than the baseline and it does not overfit anymore. We note that using our approach, a PRADO model transfer-learned on just 10% of training data achieves very competitive performance resulting in less than 10% drop in relative accuracy on Yelp data set (see Table 3).

Conclusion
In this paper, we proposed a novel trainable projection on-device neural network with attention, which is capable of capturing long range dependencies making it flexible for solving long text classification tasks. We introduced trainable projection technique, which via the visualization of the attention mechanism shows that it effectively captures the semantic representation of the document, while still saving on storage and producing compact models. We demonstrated the effectiveness of our approach PRADO by conducting experiments on multiple large scale document classification tasks. The obtained results show that our approach improved upon traditional linear classifiers from 2% to 12%, character and word level CNNs and LSTMs neural approaches with 1% to 6% depending on the data set, task and approach. This is very impressive given the small compact model produced by PRADO . Similarly, the quantized version of our approach had consistent performance. Finally, we applied our model in a transfer learning scenario, which demonstrated the robustness of our approach and ability to further improve performance in limited data scenarios. In the future, we want to extend this approach to more natural language tasks and languages.