On-device Structured and Context Partitioned Projection Networks

A challenging problem in on-device text classification is to build highly accurate neural models that can fit in small memory footprint and have low latency. To address this challenge, we propose an on-device neural network SGNN++ which dynamically learns compact projection vectors from raw text using structured and context-dependent partition projections. We show that this results in accelerated inference and performance improvements. We conduct extensive evaluation on multiple conversational tasks and languages such as English, Japanese, Spanish and French. Our SGNN++ model significantly outperforms all baselines, improves upon existing on-device neural models and even surpasses RNN, CNN and BiLSTM models on dialog act and intent prediction. Through a series of ablation studies we show the impact of the partitioned projections and structured information leading to 10% improvement. We study the impact of the model size on accuracy and introduce quatization-aware training for SGNN++ to further reduce the model size while preserving the same quality. Finally, we show fast inference on mobile phones.


Introduction
Over the last years, the usage of conversational assistants has become extremely popular. On a daily basis, people request weather information, check calendar appointments, perform calls. Large part of the conversational and natural language understanding happens on the server side and then fulfilled resulting in response delays, inconsistent experience and privacy concerns. Therefore, there is a huge demand for developing on-device natural language models that work entirely on-device such as mobile phones, tablets, watches and any internet of things (IoT) devices. On-device computation can circumvent the latency delays, can increase the user privacy and further enable new capabilities for real time interaction.
One way to develop on-device natural language understanding is to leverage the power of deep neural networks, which over the years have shown tremendous progress and have improved upon state-of-the-art machine learning methods in Natural Language Processing (NLP) (Sutskever et al., 2014), Speech  and Vision (Krizhevsky et al., 2012). These advancements were byproducts of the availability of large amounts of data and high performance computing, enabling the development of more complex and robust neural network architectures. However, despite their success, yet it remains challenging to deploy deep networks on-device such as mobile phone, smart watch and IoT. The limited memory and computation power combined with the need of fast latency require the development of novel on-device neural networks.
Inspired by (Ravi and Kozareva, 2018), we propose a novel on-device neural network (SGNN++ ) that uses joint structured (word+character) information and context partitioned projections to learn robust models for short text classification. We employ a modified version of the locality sensitive hashing (LSH) to reduce input dimension from millions of unique words/features to a short, fixedlength sequence of bits (Ravi, 2017(Ravi, , 2019. This allows us to compute a projection for an incoming text very fast, on-the-fly, with a small memory footprint on the device without storing any incoming text and word embeddings.
Unlike prior work that focused on developing the best neural network for a specific NLP task and language, we develop one SGNN++ architecture with the same parameters and apply it to wide range of tasks and languages such as En-glish, French, Spanish and Japanese. Our experimental results show that SGNN++ improves upon baselines, prior on-device state-of-the-art and even non-on-device RNN, CNN and BiLSTM methods. The main contributions of our paper are: • Novel embedding-free SGNN++ on-device neural model with quantization, and joint structured and context partitioned projections; • Novel context partitioned projections result in small memory footprint with better performance and speedup.
• First on-device model evaluated on a wide range of applications such as dialog act, intent prediction, customer feedback.
• First on-device model evaluation on English, Spanish, French and Japanese languages demonstrating the language agnostic power of SGNN++ .
• Comparison against prior on-device stateof-the-art neural models, which SGNN++ significantly improves upon across multiple tasks.
• Ablation studies that show the impact of word vs joint word and character representation on accuracy; the power of the partitioned projection vectors on speed and inference; and the ability of SGNN++ to compress large models while still maintaining high accuracy; the fast latency of the on-device model.

On-device Partitioned Projection Networks (SGNN++ )
We propose new on-device neural network architectures for NLP inspired by projection model architectures (Ravi, 2017(Ravi, , 2019. The projection model is a neural network with dynamicallycomputed layers that encodes a set of efficient-tocompute operations which can be performed directly on device for inference. Unlike prior work that employs projections (Ravi and Kozareva, 2018), our new model defines a set of efficient structured and contextdependent "projection" functions P C (x i ) that progressively transform each input instance x i to a different space Ω P C and then performs learning in this space to map it to corresponding outputs y i . The model applies dynamically-computed projection functions that are conditioned on context in multiple ways to achieve higher discriminative power (for classification tasks) and better efficiency wrt memory footprint and speedup. Firstly, we introduce a joint structured projection model that uses language structure to project word and character information in each input instance separately (Ω P=Ω Pw Ω Pc ) and combines them during learning. Secondly, we introduce context-partitioned projection functions P C k (x i ) that leverage feature-context hierarchy to partition the projection space Ω P based on context type. Both these methods enable learning powerful compact neural networks that achieve high performance and fast inference with low memory footprint.

SGNN++ Architecture
Our on-device projection partitioned neural network architecture is a deep multi-layered contextdependent locality-sensitive projection model. Figure 1 shows the model architecture. The neural model uses projections (Ravi, 2017(Ravi, , 2019 making it an embedding-free approach, i.e., the model can be learned without the need to initialize, load or store any feature or vocabulary weight matrices. This is different from the majority of the widelyused state-of-the-art deep learning techniques in NLP whose performance depends on embeddings pre-trained on large corpora. In this work, we also introduce a novel joint structured projections and context partitioned projection spaces that result in highly efficient and compact neural network models for on-device applications. We will also show how SGNN++ yields significant improvements over prior work (Ravi and Kozareva, 2018) and reaches state-of-the-art on multiple NLP tasks and languages.

Model Overview
In this work, we focus on short text classification. Each input x i contains a sequence of tokens, where x it represents the t-th token in the input. The proposed SGNN++ model progressively projects each raw input text x i to an efficient vector representation i p and then learns a classifier to map x i to output class y i .
The raw input text x i is first converted to an intermediate feature vector F(x i ) using raw text features such as skip-grams. The projection i p for x i is then computed by applying a series of T context-partitioned projection functions P 1 , ..., P T on the intermediate sparse feature vector x i . Details of the projections and computation for SGNN++ are described as follows.
where P j (x i ) refers to output from the j-th projection function. This is followed by a stack of additional layers and non-linear activation to create deep, non-linear combinations of projections that permit the network to learn complex mappings from inputs x i to outputs y i .
where h p is computed directly from the projection output, h t is applied at intermediate layers of the network with depth k followed by a final softmax activation layer at the top. In an L-layer SGNN++ , h t , where t = p, p + 1, ..., p + L − 1 refers to the L subsequent layers after the projection layer. W p , W t , W o and b p , b t , b o represent trainable weights and biases respectively. The projection transformations use pre-computed parameterized functions, i.e., they are not trained during learning, and their outputs are concatenated to form the hidden units for subsequent operations.

Joint Structured Projection Network
Unlike prior work that employs projections (Ravi and Kozareva, 2018), we make an important observation that input instances x i are drawn from natural language rather than random continuous vectors and thereby encode some inherent structurefor example, sentences contain sequence of words, and words contain characters. This motivates us to leverage the underlying linguistic structure in the input and build a hierarchical projection model from the raw text in a progressive fashion rather than taking a one-shot projection approach. We define a joint structured projection model (SGNN++ ). The model jointly combines word and character level context information from the input text to construct the language projection layer.

Word Projections
Given an input x i with t words, we first project sequence x i to word projection vectors. We use word-level context features (e.g., phrases and word-level skip-grams) extracted from the raw text to compute the intermediate feature vector x w = F w and compute projections.
We reserve bits to capture the word projection space computed using a series of functions P 1 w , ..., P w . The functions project the sentence structure into low-dimensional representation that captures similarity in the word-projection space (Sankar et al., 2019).

Character Projections
Given the input text x i , we can capture morphology (character-level) information in a similar way. We use character-level context features (e.g., character-level skip-grams) again extracted directly from the raw text to compute x c = F c and compute character projections i pc .
The character feature space and hence projections i pc are reserved and computed separately. Note that even though we compute separate projections for character-level context, the SGNN++ model re-uses the remaining T − functions for this step and hence keeps the overall space and time complexity for projections directly ∝ T .

Joint Structured Model and Extension
We then combine these into i p for the joint structure projection model as shown in Figure 1. The projection functions dynamically transform each input text x i to a low-dimensional representation i p via context-dependent projection spaces that jointly capture word and character information in a succinct representation. The joint structured projections are followed by a stack of additional layers that jointly learn non-linear combinations of these projection spaces to build the classifier.
The choice of intermediate features used for projections can be flexible and different for F w and F c . For example, we could apply stemming or extract other morphological features for computing i pc . Similarly, we can use syntax information from Part-of-Speech tags or constituency parses at the sentence-level for computing i pw . However, these features might not be available on device to perform inference-e.g., syntax features require an additional tagging or parsing model to be loaded on device, which incurs additional complexity and latency. Hence, for efficiency and simplicity, we only use the same type of raw features (e.g., skipgrams) for word and character-level projections.

Context Partitioned Projection Network
In the SGNN++ model, we further leverage the feature-context type information to introduce an additional level of hierarchy in the network. The motivation is as follows-we use locality-sensitive projections for projection(.) step to transform input text to a low-dimensional representation. Incorporating global information, via contextdependent projections, enables the model to vary the language projections and encode them separately based on feature-type. We use this to avoid collisions in the projected space between different feature types (e.g., unigrams vs. bigrams) and also help the neural network learn the importance of specific types of projections based on the classification task rather than pooling them together and fixing this apriori.
We achieve this by introducing contextpartitioned projections in SGNN++ , i.e., we partition the overall projection space into sub-partitions based on context-type. Let C K denote the type of intermediate features extracted via F, where C 1 = unigrams, C 2 = bigrams, and so on. Both word and character-level outputs i pw , i pc (describe earlier) are generated using context-partitioned projections, i.e., each projection space Ω P is partitioned into sub-spaces Ω P C k based on context type. The type of context used to represent the input text determines the function choice and size of the sub-partitions and thereby the number of corresponding bits reserved in the projection outputs i pw and i pc . [ where, C K denotes a specific type of contextfeature extracted from the input and P 1 C K ...P K C K denote the projection functions applied to the input for context type C K . max K is the total number of context types and K is the number of projection functions in the partition reserved for C K and hence the number of output bits reserved in projection output.
Effect of Partitioned Projections: Partitioning the projection space has a significant effect on both memory and time-complexity. This results in a significant speedup for the projection network both during training and inference since the overall size of intermediate feature context vectors F (per type) is smaller and hence fewer operations are required to compute each projection output and these can be computed in parallel. Also, in SGNN++ the overall projection complexity does not increase since we keep T fixed jw,c j = T . Moreover, the context partitioned SGNN++ neural network uses the global context information to efficiently decompose and learn projections from different contexts and combine them effectively for the classification task.

q-SGNN++ : Compressing Model Further
We also learn hardware-optimized variants of SGNN++ using quantized training similar to (Jacob et al., 2017). This permits fast 8-bit arithmetic operations in the model achieving 4x further reduction in overall model size and improved latency. Both SGNN++ and q-SGNN++ can run efficiently on edge devices and support inference through TensorFlow Lite (tfl) open-source library.

Computing Projections on-the-fly
We employ an efficient randomized projection method for each projection(.) step. We use locality sensitive hashing (LSH) (Charikar, 2002) to model the underlying projection operations in SGNN++ . Equation 1 applies F to dynamically extract features from the raw input text. Text features (e.g., skip-grams) at word and character-level are converted into 64-bit feature-ids f j (via hashing) to generate a sparse feature representation x i of feature-id, weight pairs (f m , w m ). For the projection(.) step (Equation 4), a projection vector P j is first constructed on-the-fly using a hash function with feature ids f m ∈ x i and fixed seed j as input, then dot product of the two vectors < x i , P j > is computed and transformed into binary representation P j ( x i ) using sgn(.) of the dot product.
As shown in Figure 1, both F w,c and P w,c steps are computed on-the-fly, i.e., no word/characterembedding or vocabulary/feature matrices need to be stored and looked up during training or inference. Instead feature-ids and projection vec-tors are dynamically computed via hash functions. For intermediate feature weights w m , we use observed counts in each input text and do not use precomputed statistics like idf. Hence the method is embedding-free.

Model Parameters
SGNN++ uses a total of T different projection functions P j=1...T , each resulting in d-bit vector that is concatenated to form the projected vector i p in Equations 11. T and d can be tuned to trade-off between prediction quality and model size of the SGNN++ network. For the intermediate feature step F in Equations 1, 9, 11, we use skip-gram features (3-grams with skip-size=2) extracted from raw text both for word and character projections. We set = T 2 in Equation 9, i.e., the joint structured model (described in Section 2.3) reserves half the projection space ( T 2 · d bits) for word projections and remaining half for character projections. The choice of features also determines the size of the context-dependent sub-partitions within each projection space-for example, if we choose features with upto 3-gram context, then max K = 3 and we compute 3 projection sub-partitions for C 1 , C 2 , C 3 in Equation 14.

Training, Inference and Optimization
SGNN++ is trained from scratch on the task data using a supervised loss defined wrt ground truthŷ i L(.) = i∈N cross − entropy(y i ,ŷ i ). During training, the network learns to choose and combine context-dependent projection operations that are more predictive for a given task. SGNN++ uses language projections to transform the input into compact bit vectors. This yields a drastically lower memory footprint both in terms of number and size of parameters as well as computation cost.
During training, the network learns to move the gradients for points that are nearby to each other in the projected bit space Ω P in the same direction. SGNN++ is trained end-to-end using backpropagation. Training can progress efficiently with stochastic gradient descent with distributed computing on high-performance CPUs or GPUs.

Complexity
Overall complexity for inference with the SGNN++ model depends on the projection layer, O(n · T · d) where n is the observed feature size (*not* overall vocabulary size) which is linear in input size, d is the number of LSH bits specified for each projection vector P j , and T is the number of projection functions used. However, each partitioned projection operation in the model is much faster in practice than non-partitioned projection since it depends on size of intermediate vectors which are partitioned by context and smaller in size. The model size (in terms of number of parameters) and memory storage required for the projection inference step is O(T · d · C), where C is the number of hidden units in h p in the multi-layer projection network and typically smaller than T · d.

Datasets & Tasks
We evaluate our on-device SGNN++ model on four NLP tasks and languages such as English, Japanese, Spanish and French. The datasets were selected so we can compare against prior ondevice work (Ravi and Kozareva, 2018) and also test the language agnostic capabilities of SGNN++ • MRDA: Meeting Recorder Dialog Act is a dialog corpus of multiparty meetings annotated with 6 dialog acts (Adam et al., 2003;Shriberg et al., 2004).
• SwDA: Switchboard Dialog Act is a popular open domain dialog corpus between two speakers with 42 dialog acts (Godfrey et al., 1992;Jurafsky et al., 1997).
• ATIS: Intent Understanding is a widely used corpus in the speech and dialog community (Tür et al., 2010) for understanding different intents during flight reservation.
• CF: Customer Feedback is a multilingual customer feedback analysis task ) that aims at categorizing customer feedback as "comment, "request, "bug, "complaint, "meaningless, or "undetermined. The data is in English (EN), Japanese (JP), French (FR) and Spanish (SP) languages. Table 1 shows the characteristics of each task: language, number of classes, training and test data.

Experimental Setup & Parameter Tuning
We setup our experiments as given a classification task and a dataset, generate an on-device model. For each task, we report Accuracy on the test set.  Unlike prior work that aims at finding the best configuration for a given datasets or task, we use the same on-device architecture and settings across all datasets and tasks. We use 2-layer SGNN++ mini-batch size of 100, dropout rate of 0.25, learning rate initialized to 0.025 with cosine annealing decay (Loshchilov and Hutter, 2016). We do not do any additional dataset-specific tuning or processing. Training is with SGD over shuffled minibatches with Adam optimizer (Kingma and Ba, 2014).

Experimental Results
This section focuses on the multiple experiments we have conducted. Table 2 shows the results on the different NLP tasks and languages. Overall, SGNN++ consistently outperformed all baselines, reached state-of-the-art against prior ondevice state-of-the-art work (Ravi and Kozareva, 2018) and even outperformed non-on-device stateof-the-art RNN, CNN and BiLSTM models for MRDA, SWDA, ATIS and CF tasks.

Comparison with Baselines
For each task, we compared SGNN++ against well established baselines. MRDA and SWDA use Naive Bayes classifier (Lee and Dernoncourt, 2016), which our SGNN++ model outperformed with 14 to 41%. ATIS uses a majority baseline, which SGNN++ outperformed with 21.51%. CF  uses trigrams to find the most similar annotated sentences to the input and assigns their label as final prediction. SGNN++ consistently outperformed CF similarity baselines with 16.2%, 17.66%, 16.18 and 6.69% for EN, JP, FR and SP respectively.

Comparison with Non-On-Device Work
The characteristics of on-device models are low memory footprint and low latency. Therefore, a direct comparison of an on-device model against cloud based neural networks might not be fair, due to the resource constraints for on-device models. But we wanted to showcase that despite such constraints, yet our SGNN++ learns powerful neural networks that are competitive and can even outperform widely used approaches like RNNs and CNNs with huge parameters and pre-trained word embeddings. Another aspect to consider on why such a comparison might not be fair, is that prior work focused mostly on creating the best model for a specific task with lot of fine tuning and additional resources like pre-trained embedding, whereas we use the same SGNN++ architecture and parameters across multiple tasks and languages.
Taking these major differences into consideration, we still compare results against prior non-ondevice state-of-art neural networks. As shown in Table 2 only (Khanpour et al., 2016;Ortega and Vu, 2017;Lee and Dernoncourt, 2016) have eval-uated on more than one task, while the rest of the methods target specific one. We denote with − models that do not have results for the task. SGNN++ is the only approach spanning across multiple NLP tasks and languages.
On the Dialog Act MRDA and SWDA tasks, SGNN++ outperformed deep learning methods like CNN (Lee and Dernoncourt, 2016), RNN (Khanpour et al., 2016) and RNN with gated attention (Tran et al., 2017) and reached the best results of 87.3% and 88.43% accuracy.
For Intent Prediction, SGNN++ also improved with 0.13% 1.13% and 2.63% over the gated attention (Goo et al., 2018), the joint slot and intent biL-STM model (Hakkani-Tur et al., 2016) and the attention slot and intent RNN (Liu and Lane, 2016) on the ATIS task. This is very significant, given that (Goo et al., 2018;Hakkani-Tur et al., 2016;Liu and Lane, 2016) used a joint model to learn the slot entities and types, and used this information to better guide the intent prediction, while SGNN++ does not have any additional information about slots, entities and entity types.
Overall, SGNN++ achieves impressive results given the small memory footprint and the fact that it did not rely on pre-trained word embeddings like (Hakkani-Tur et al., 2016;Liu and Lane, 2016) and used the same architecture and model parameters across all tasks and languages. We believe that the dimensionality-reduction techniques like locality sensitive context projections jointly coupled with deep, non-linear functions are effective at dynamically capturing low dimensional semantic text representations that are useful for text classification applications.

Ablation Studies
In this section, we show multiple ablation studies focusing on: (1) impact of partitioned projections and joint structured representation on accuracy; (2) impact of model size on accuracy; quantized version of SGNN++ which reduces model size while preserving same quality; and (3) SGNN++ latency.

Impact of Joint Structured & Context
Partitioned Projections on Accuracy Our SGNN++ model uses joint structured (word+character) and context partitioned projections. We want to show the impact of the joint structure (word+character) vs word only; as well as the impact of the partitioned vs non-partitioned projections. Table 3 shows the obtained results on the ATIS intent prediction dataset. First, using joint structured (word+character) information leads to significantly better performance compared to word only. For instance, +9% for non-partitioned projections and +3.9% for partitioned projections. Second, significant improvement is seen when using partitioned vs non-partitioned projections, +6.14% for word and +1% for word+character. Overall, the novel joint structured and context partitioned projections we introduced in our SGNN++ model improve +10.06% performance compared to models using only word and non-partitioned projections.

ATIS
Partitioned  It is important to note that in addition to the accuracy improvements, SGNN++ partitioned projection models are also significantly faster for inference and training (upto 3.3X). For example, using T = 80, d = 14 and bigram word features (max K = 2) for a 10-word sequence requires 80 × 14 × 6 = 6720 multiply-add operations for partitioned projections compared to 80 × 14 × 19 = 21280 for non-partitioned model.

Accuracy vs Model Size
It is easy to customize our model for different devices such as watches, phones or IoT with different size constraints. To showcase this, we show results on varying projection sizes and network parameters. Furthermore, we also trained quantized versions of our SGNN++ model denoted by qS-GNN++ which achieves additional model size reduction while maintaining high accuracy. Figure  2 shows the obtained results on the ATIS dataset. Each data point in the figure represents a SGNN++ or qSGNN++ model trained with specific partition projection parameter configuration. We show the model size and the accuracy achieved for that size. Overall, SGNN++ models achieve high accuracy even at low sizes. For instance, 100KB model yields 82.87% accuracy compared to 2.5MB model that yields 94.74%. For a given SGNN++ model we can further reduce the size with little performance degradation by applying the quantization-aware training. For instance, SGNN++ 107KB model (T = 5, d = 14) yields 82.87%, but can be further compressed to qS-GNN++ with 33KB and 80.18% accuracy. We also take our model to the extreme, we are able to train qSGNN++ model with extremely tiny size of 7KB (T = 3, d = 14), while still achieving 77.16%.

Model Latency
In addition to being small and highly accurate, ondevice model has to be fast. We measure the latency of our on-device SGNN++ model on a Pixel phone. Given an input text, we measure inference time on the Pixel device and report average latency. On ATIS dataset, SGNN++ accuracy is 93.73% with average latency of 3.35 milliseconds. This shows that our SGNN++ model is compact, highly accurate and with low latency (i.e. very fast).

Conclusion
We proposed embedding-free on-device neural network that uses joint structured and context partitioned projections for short text classification. We conducted experiments on wide range of NLP applications such as dialog acts, intent prediction and customer feedback. We evaluated the approach on four languages, showing the language agnostic capability of our on-device SGNN++ model. We used the same model architecture and parameter settings across all languages and tasks, which demonstrates the generalizability of this approach compared to prior work that built custom models. Overall, our SGNN++ approach outperformed all baselines from 14 to 41%, improved upon state-of-the-art on-device work (Ravi and Kozareva, 2018) with up to 5%, and also outperformed non-on-device neural approaches (Hakkani-Tur et al., 2016;Liu and Lane, 2016;Dzendzik et al., 2017;Elfardy et al., 2017). Through multiple ablation studies, we showed the impact of partitioned projections on accuracy and the impact of model size on accuracy. We trained quantized versions of SGNN++ showing that we can further reduce the model size while preserving quality. Finally, we showed SGNN++ fast latency on Pixel phone.