Self-Governing Neural Networks for On-Device Short Text Classification

Deep neural networks reach state-of-the-art performance for wide range of natural language processing, computer vision and speech applications. Yet, one of the biggest challenges is running these complex networks on devices such as mobile phones or smart watches with tiny memory footprint and low computational capacity. We propose on-device Self-Governing Neural Networks (SGNNs), which learn compact projection vectors with local sensitive hashing. The key advantage of SGNNs over existing work is that they surmount the need for pre-trained word embeddings and complex networks with huge parameters. We conduct extensive evaluation on dialog act classification and show significant improvement over state-of-the-art results. Our findings show that SGNNs are effective at capturing low-dimensional semantic text representations, while maintaining high accuracy.


Introduction
Deep neural networks are one of the most successful machine learning methods outperforming many state-of-the-art machine learning methods in natural language processing (Sutskever et al., 2014), speech  and visual recognition tasks (Krizhevsky et al., 2012). The availability of high performance computing has enabled research in deep learning to focus largely on the development of deeper and more complex network architectures for improved accuracy. However, the increased complexity of the deep neural networks has become one of the biggest obstacles to deploy deep neural networks ondevice such as mobile phones, smart watches and IoT (Iandola et al., 2016). The main challenges with developing and deploying deep neural network models on-device are (1) the tiny memory footprint, (2) inference latency and (3) significantly low computational capacity compared to high performance computing systems such as CPUs, GPUs and TPUs on the cloud.
There are multiple strategies to build lightweight text classification models for ondevice. One can create a small dictionary of common input → category mapping on the device and use a naive look-up at inference time. However, such an approach does not scale to complex natural language tasks involving rich vocabularies and wide language variability. Another strategy is to employ fast sampling techniques (Ahmed et al., 2012;Ravi, 2013) or incorporate deep learning models with graph learning like (Bui et al., 2017(Bui et al., , 2018, which result in large models but have proven to be extremely powerful for complex language understanding tasks like response completion (Pang and Ravi, 2012) and Smart Reply (Kannan et al., 2016).
In this paper, we propose Self-Governing Neural Networks (SGNNs) inspired by projection networks (Ravi, 2017). SGNNs are on-device deep learning models learned via embedding-free projection operations. We employ a modified version of the locality sensitive hashing (LSH) to reduce input dimension from millions of unique words/features to a short, fixed-length sequence of bits. This allows us to compute a projection for an incoming text very fast, on-the-fly, with a small memory footprint on the device since we do not need to store the incoming text and word embeddings. We evaluate the performance of our SGNNs on Dialogue Act classification, because (1) it is an important step towards dialog interpretation and conversational analysis aiming to understand the intent of the speaker at every utterance of the conversation and (2) deep learning methods reached state-of-the-art (Lee and Dernoncourt, 2016;Khanpour et al., 2016;Tran et al., 2017;Ortega and Vu, 2017).
The main contributions of the paper are: • Novel Self-Governing Neural Networks (SGNNs) for on-device deep learning for short text classification.
• Compression technique that effectively captures low-dimensional semantic text representation and produces compact models that save on storage and computational cost.
• On the fly computation of projection vectors that eliminate the need for large pre-trained word embeddings or vocabulary pruning.

Self-Governing Neural Networks
We model the Self-Governing network using a projection model architecture (Ravi, 2017). The projection model is a simple network with dynamically-computed layers that encodes a set of efficient-to-compute operations which can be performed directly on device for inference. The model defines a set of efficient "projection" functions P( x i ) that project each input instance x i to a different space Ω P and then performs learning in this space to map it to corresponding outputs y p i . A very simple projection model comprises just few operations where the inputs x i are transformed using a series of T projection functions P 1 , ..., P T followed by a single layer of activations.

Model Architecture
In this work, we design a Self-Governing Neural Network (SGNN) using multi-layered localitysensitive projection model. Figure 1 shows the model architecture of the on-device SGNN network. The self-governing property of this network stems from its ability to learn a model (e.g., a classifier) without having to initialize, load or store any feature or vocabulary weight matrices. In this sense, our method is a truly embedding-free approach unlike majority of the widely-used stateof-the-art deep learning techniques in NLP whose performance depends on embeddings pre-trained on large corpora. Instead, we use the projection functions to dynamically transform each input to a low-dimensional representation. Furthermore, we stack this with additional layers and non-linear activations to achieve deep, non-linear combinations of projections that permit the network to learn complex mappings from inputs x i to outputs y i . An SGNN network is shown below: where, i p refers to the output of projection operation applied to input x i , h p is applied to projection output, h t is applied at intermediate layers of the network with depth k followed by a final softmax activation layer at the top. In a k-layer SGNN, h t , where t = p, p + 1, ..., p + k − 1 refers to the k subsequent layers after the projection layer. W p , W t , W o and b p , b t , b o represent trainable weights and biases respectively. The projection transformations use precomputed parameterized functions, i.e., they are not trained during the learning process, and their outputs are concatenated to form the hidden units for subsequent operations. Each input text x i is converted to an intermediate feature vector (via raw text features such as skip-grams) followed by projections.
On-the-fly Computation. The transformation step F dynamically extracts features from the raw input. Text features (e.g., skip-grams) are converted into feature-ids f j (via hashing) to generate a sparse feature representation x i of feature-id, weight pairs (f j , w j ) . This intermediate feature representation is passed through projection functions P to construct projection layer i p in SGNN. For this last step, a projection vector P k is first constructed on-the-fly using a hash function with feature ids f j in x i and fixed seed as input, then dot product of the two vectors < x i , P k > is computed and transformed into binary representation P k ( x i ) using sgn(.) of the dot product. As shown in Figure 1, both F and P steps are computed on-the-fly, i.e., no word-embedding or vocabulary/feature matrices need to be stored and looked up during training or inference. Instead feature-ids and projection vectors are dynamically computed via hash functions. For intermediate feature weights w j , we use observed counts in each input text and do not use pre-computed statistics like idf. Hence the method is embedding-free.
Model Optimization. The SGNN network is trained from scratch on the task data using a supervised loss defined wrt ground truthŷ i : During training, the network learns to choose and apply specific projection operations P j (via activations) that are more predictive for a given task. The choice of the type of projection matrix P as well as representation of the projected space Ω P has a direct effect on computation cost and model size. We leverage an efficient randomized projection method and use a binary representation {0, 1} d for Ω P . This yields a drastically lower memory footprint both in terms of number and size of parameters.
Computing Projections. We employ an efficient randomized projection method for the projection step. We use locality sensitive hashing (LSH) (Charikar, 2002) to model the underlying projection operations in SGNN. LSH is typically used as a dimensionality reduction technique for clustering (Manning et al., 2008). LSH allows us to project similar inputs x i or interme-diate network layers into hidden unit vectors that are nearby in metric space. We use repeated binary hashing for P and apply the projection vectors to transform the input x i to a binary hash representation denoted by P k ( This results in a dbit vector representation, one bit corresponding to each projection row P k=1...d .
The same projection matrix P is used for training and inference. We never need to explicitly store the random projection vector P k since we can compute them on the fly using hash functions over feature indices with a fixed row seed rather than invoking a random number generator. This also permits us to perform projection operations that are linear in the observed feature size rather than the overall feature or vocabulary size which can be prohibitively large for high-dimensional data, thereby saving both memory and computation cost. Thus, SGNN can efficiently model highdimensional sparse inputs and large vocabulary sizes common for text applications instead of relying on feature pruning or other pre-processing heuristics employed to restrict input sizes in standard neural networks for feasible training. The binary representation is significant since this results in a significantly compact representation for the projection network parameters that in turn considerably reduces the model size.

SGNN Parameters.
In practice, we employ T different projection functions P j=1...T , each resulting in d-bit vector that is concatenated to form the projected vector i p in Equation 5. T and d vary depending on the projection network parameter configuration specified for P and can be tuned to trade-off between prediction quality and model size. Note that the choice of whether to use a single projection matrix of size T · d or T separate matrices of d columns depends on the type of projection employed (dense or sparse). For the intermediate feature step F in Equation 5, we use skip-gram features (3-grams with skip-size=2) extracted from raw text.

Training and Inference
We use the compact bit units to represent the projection in SGNN. During training, the network learns to move the gradients for points that are nearby to each other in the projected bit space Ω P in the same direction. SGNN network is trained end-to-end using backpropagation. Training can progress efficiently with stochastic gradient descent with distributed computing on highperformance CPUs or GPUs.
Complexity. The overall complexity for SGNN inference, governed by the projection layer, is O(n · T · d), where n is the observed feature size (*not* overall vocabulary size) which is linear in input size, d is the number of LSH bits specified for each projection vector P k , and T is the number of projection functions used in P. The model size (in terms of number of parameters) and memory storage required for the projection inference step is O(T · d · C), where C is the number of hidden units in h p in the multi-layer projection network and typically smaller than T · d.

Data Description
We conduct our experimental evaluation on two dialog act benchmark datasets.
• SWDA: Switchboard Dialog Act Corpus (Godfrey et al., 1992;Jurafsky et al., 1997) is a popular open domain dialogs corpus between two speakers with 42 dialogs acts.
• MRDA: ICSI Meeting Recorder Dialog Act Corpus (Adam et al., 2003;Shriberg et al., 2004) is a dialog corpus of multiparty meetings with 5 tags of dialog acts.  Table 1 summarizes dataset statistics. We use the train, validation and test splits as defined in (Lee and Dernoncourt, 2016;Ortega and Vu, 2017).

Experimental Setup
We setup our experimental evaluation, as follows: given a classification task and a dataset, we generate an on-device model. The size of the model can be configured (by adjusting the projection matrix P) to fit in the memory footprint of the device, i.e. a phone has more memory compared to a smart watch. For each classification task, we report Accuracy on the test set.

Hyperparameter and Training
For both datasets we used the following: 2layer SGNN (P T =80,d=14 × FullyConnected 256 × FullyConnected 256 ), mini-batch size of 100, dropout rate of 0.25, learning rate was initialized to 0.025 with cosine annealing decay (Loshchilov and Hutter, 2016). Unlike prior approaches (Lee and Dernoncourt, 2016;Ortega and Vu, 2017) that rely on pre-trained word embeddings, we learn the projection weights on the fly during training, i.e word embeddings (or vocabularies) do not need to be stored. Instead, features are computed on the fly and are dynamically compressed via the projection matrices into projection vectors. These values were chosen via a grid search on development sets, we do not perform any other dataset-specific tuning. Training is performed through stochastic gradient descent over shuffled mini-batches with Nesterov momentum optimizer (Sutskever et al., 2013), run for 1M steps.

Results
Tables 2 and 3 show results on the SwDA and MRDA dialog act datasets. Overall, our SGNN model consistently outperforms the baselines and prior state-of-the-art deep learning models.

Baselines
We compare our model against a majority class baseline and Naive Bayes classifier (Lee and Dernoncourt, 2016). Our model significantly outperforms both baselines by 12 to 35% absolute.

Comparison against State-of-art Methods
We also compare our performance against prior work using HMMs (Stolcke et al., 2000) and recent deep learning methods like CNN (Lee and Dernoncourt, 2016), RNN (Khanpour et al., 2016) and RNN with gated attention (Tran et al., 2017).
To the best of our knowledge, (Lee and Dernoncourt, 2016;Ortega and Vu, 2017;Tran et al., 2017) are the latest approaches in dialog act classification, which also reported on the same data splits. Therefore, we compare our research against these works. According to (Ortega and Vu, 2017), prior work by (Ji and Bilmes, 2006) achieved promising results on the MRDA dataset, but since the evaluation was conducted on a different data split, it is hard to compare them directly.
For both SwDA and MRDA datasets, our SGNNs obtains the best result of 83.1 and 86.7 accuracy outperforming prior state-of-the-art work. This is very impressive given that we work with very small memory footprint and we do not rely on pre-trained word embeddings. Our study also shows that the proposed method is very effective for such natural language tasks compared to more complex neural network architectures such as deep CNN (Lee and Dernoncourt, 2016) and RNN variants (Khanpour et al., 2016;Ortega and Vu, 2017). We believe that the compression techniques like locality sensitive projections jointly coupled with non-linear functions are effective at capturing lowdimensional semantic text representations that are useful for text classification applications.

Discussion on Model Size and Inference
LSTMs have millions of parameters, while our on-device architecture has just 300K parameters (order of magnitude lower). Most deep learning methods also use large vocabulary size of 10K or higher. Each word embedding is represented as 100-dimensional vector leading to a storage requirement of 10, 000 × 100 parameter weights just in the first layer of the deep network. In contrast, SGNNs in all our experiments use a fixed 1120-dimensional vector regardless of the vocabulary or feature size, dynamic computation results

Conclusion
We proposed Self-Governing Neural Networks for on-device short text classification. Experiments on multiple dialog act datasets showed that our model outperforms state-of-the-art deep leaning methods (Lee and Dernoncourt, 2016;Khanpour et al., 2016;Ortega and Vu, 2017). We introduced a compression technique that effectively captures low-dimensional semantic representation and produces compact models that significantly save on storage and computational cost. Our approach does not rely on pre-trained embeddings and efficiently computes the projection vectors on the fly. In the future, we are interested in extending this approach to more natural language tasks. For instance, we built a multilingual SGNN model for customer feedback classification (Liu et al., 2017) and obtained 73% on Japanese, close to best performing system on the challenge (Plank, 2017). Unlike their method, we did not use any pre-processing, tagging, parsing, pre-trained embeddings or other resources.