End to End Binarized Neural Networks for Text Classification

Deep neural networks have demonstrated their superior performance in almost every Natural Language Processing task, however, their increasing complexity raises concerns. A particular concern is that these networks pose high requirements for computing hardware and training budgets. The state-of-the-art transformer models are a vivid example. Simplifying the computations performed by a network is one way of addressing the issue of the increasing complexity. In this paper, we propose an end to end binarized neural network for the task of intent and text classification. In order to fully utilize the potential of end to end binarization, both the input representations (vector embeddings of tokens statistics) and the classifier are binarized. We demonstrate the efficiency of such a network on the intent classification of short texts over three datasets and text classification with a larger dataset. On the considered datasets, the proposed network achieves comparable to the state-of-the-art results while utilizing 20-40% lesser memory and training time compared to the benchmarks.


Introduction
In recent years, deep neural networks have achieved great success in a variety of domains, but the networks are becoming more and more computationally expensive due to their ever-growing size. This tendency has been noticed in (Strubell et al., 2019;Schwartz et al., 2019) and it has been recommended that academia and industry researchers should draw their attention towards more computationally efficient methods. At the same time, many important application areas such as chatbots, IoT devices, mobile devices, and other types of powerconstrained and resource-constrained platforms re- * The authors contributed equally to this research and work was done at NeuralSpace quire solutions that would be highly computationally and memory efficient. Such use-cases limit the potential use of the state-of-the-art deep networks. One viable solution is the transformation of these high-performance neural networks to a more computationally efficient architecture. Recently, Binarized Convolutional Neural Networks (BNN) (Hubara et al., 2016) have been developed where both weights and activations are restricted to {+1, −1}. BNN is a highly computationally efficient network with a much lower memory footprint. Tasks like language modeling (Zheng and Tang, 2016) were performed using binarized neural networks, but, to the best of our knowledge, in the area of text classification, no end to end trainable binarized architectures have been demonstrated yet.
In this paper, we introduce an architecture for the tasks of intent and text classifications that fully utilizes the power of binary representations. The input representations are tokenized and embedded in binary high-dimensional (HD) vectors forming distributed representations using the paradigm known as hyperdimensional computing (Kanerva, 2009). The binary input representations are used for training an end to end BNN classifier for intent classification. Classification performance-wise, the binarized architecture achieves results comparable to the state-of-the-art on several standard intent classification datasets. The efficiency of the proposed architecture is shown in terms of its time and memory complexity relative to non-binarized architectures.

Proposed Method
Figure 1 presents a schematic overview of the architecture. Given an input text document D, we first pre-process the document. The pre-processed document is then tokenized into the corresponding tokens < T 1 , T 2 , ..., T n >, which are used as an input to a count-based vectorizer. The representation of vectorizers, which is sparse and localist, is embedded into an HD vector (distributed representation) using hyperdimensional computing. HD vector representing the counter's content can be binarized. It is used as an input to a classifier. The primary classifier studied in this work is BNN, but other classifiers are also considered for benchmarking.

High-Dimensional embedding of vectorized representations
In order to reduce the dimensionality of representations, we use hyperdimensional computing (Kanerva, 2009). First, each unique token T i is assigned with a random d-dimensional bipolar HD vector, where d would be a hyperparameter of the method. HD vectors are stored in the item memory, which is a matrix H ∈ [d × n], where n is the number of tokens. Thus, for a token T i there is an HD vector H T i ∈ {−1, +1} [d×1] . To construct composite representations from the atomic HD vectors stored in H, hyperdimensional computing defines three key operations: permutation (ρ), binding ( , implemented via element-wise multiplication), and bundling (+, implemented via element-wise addition) (Kanerva, 2009). The bundling operation allows storing information in HD vectors (Frady et al., 2018). The three operations above allow embedding vectorized representations based on n-gram statistics into an HD vector (Joshi et al., 2016). We first generate H, which has an HD vector for each token. The permutation operation ρ is applied to H T j j times (ρ j (H T j )) to represent a relative position of token T j in an n-gram. A single HD vector corresponding to an n-gram (denoted as m) is formed using the consecutive binding of permuted HD vectors ρ j (H T j ) representing tokens in each position j of the n-gram. For example, the trigram '#he' will be embedded to an HD vector as follows: In general, the process of forming HD vector of an n-gram is m = n j=1 ρ j (H T j ), where T j is token in jth position of the n-gram; the consecutive binding operations applied to n HD vectors are denoted by . Once it is known how to form an HD vector for an individual n-gram, embedding the n-gram statistics into an HD vector h is achieved by bundling together all n-grams observed in the document: where k is the total number of unique n-grams; f i is the frequency of ith n-gram and m i is the HD vector of ith n-gram; denotes the bundling operation when applied to several HD vectors; [ * ] denotes the binarization operation, which is implemented via the sign function. The usage of [ * ] is optional, so we can either obtain binarized or non-binarized h. If h is non-binarized, its components will be integers in the range [−k, k], but these extreme values are highly unlikely since HD vectors for different n-grams are quasi-orthogonal, which means that in the simplest (but not practical) case when all n-grams have the same probability the expected value of a component in h is 0. Due to the use of for representing n-gram statistics, two HD vectors embedding two different n-gram statistics might have very different amplitudes if the frequencies in these statistics are very different. When HD vectors h are binarized, this issue is addressed. In the case of non-binarized HD vectors, we address it by using the cosine similarity, which is imposed by normalizing each h by its 2 norm; thus, all h have the same norm, and their dot product is equivalent to their cosine similarity.

Binarized Neural Networks
Based on the work of (Hubara et al., 2016), we construct BNNs capable of working with representations of texts. To take the full advantage of binarized HD vectors, we constraint the weights and  Similarly, the sign function is used in the BNN for every weight or activation to restrict them into {+1, −1} as follows: where, x can be any weight or activation value. We further define a convolutional 1D layer that creates a convolution kernel that is convolved with the input HD vector over a single spatial dimension to produce a tensor of outputs. Since gradient descent methods make small changes to the value of the weights, which cannot be done with binary values, we use the straight-through estimator idea, as mentioned in (Yin et al., 2019). We also define a value over which we clip the gradients in the backward pass: This ensures that the entire architecture is end to end trainable using gradient descent optimization.

Results and Discussions
For CNN-based architecture, 5 hidden layers were used: 3 convolutional 1D layers followed by 2 dense layers. Due to its resemblance to the original LeNet architecture (LeCun et al., 1998), we refer to this architecture as Text-LeNet. We compare the results of binarized HD vectors with the binarized Text-LeNet (BNN) architecture as the classifier against non-binarized HD vectors with non-binarized Text-LeNet. The F 1 scores are compared in Table 1 where BNN performed equally well to a Text-LeNet architecture while being 20%   to 40% more memory efficient, as shown in Figure 2 (a). Note that due to the specifics of implementation, BNNs use 32 bit float values as Text-LeNet. The memory efficiency of BNNs can be further improved by 4x when 8-bit representations are used and up to 32x if a single bit representations are used. However, the hardware limitations prevented us from going to that extreme. On the performance side, BNNs outperforms the Text-LeNet for AskUbuntu and WebApplication datasets on 4 out of 6 tokenizers. The results reported in Table 1 used 512 dimensional HD vectors for Chatbot, AskUbuntu, and WebApplication corpus, while 1, 024 dimensional HD vectors were used for the 20NewsGroups dataset.
One thing to note here is that Text-LeNet also used HD vectors with the mentioned tokenizers, but the HD vectors were non-binarized. HD vectors in itself are already faster and much more efficient than counter-based representations, as shown in (Alonso et al., 2020). When experimenting with other embedding methods like GloVe, the training was significantly slower; therefore, HD vectors were used for all the experiments. In addition to that, using the binarized classifier (BNN) further improved the training time up to 50% per epoch when compared to non-binarized classifier on all four datasets, as shown in Figure 2 (b). Furthermore, when compared to GloVe embeddings with Text-LeNet, HD BNN used around 20 -40% lesser  Table 3: F 1 score comparison of various platforms on intent classification datasets of short texts with methods used in the paper. Some results are taken from (Alonso et al., 2020) memory for all the intent classification datasets.
We also benchmarked the binarized HD vectors with binarized 300-dimensional GloVe vectors and the binarized version of counter-based representation for SemHash tokenizer (Alonso et al., 2020) for all the datasets. Table 2 summarizes the results of the comparison. All the binarized representations were trained with the same BNN classifier. Binarized HD vectors performed significantly better than other binarized methods outperforming binarized GloVe by 4 -25% and binarized SemHash by 2 -8% on 2 out of 3 smaller intent classification datasets and achieved comparable results for AskUbuntu dataset. The trend continued for 20NewsGroups with binarized HD achieving 5 -7% better F 1 scores. Note that for the SemHash counter-based vectorizer, we put a sign function sign(x) = +1 for x > 0 and − 1 otherwise.
In Figure 3, MLP and Linear SVC with all the tokenizers with HD vectors as representation are compared with MLP and Linear SVC classifiers with SemHash tokenizers and counter-based vectorizer as representation from (Alonso et al., 2020). The F 1 score is comparable to the state-of-the-art for both MLP and SVC. For all small intent classification datasets, binarized HD vectors have achieved better results than non-HD vectors. The proposed architecture beats the non-HD baselines by +2% for AskUbuntu and Chatbot Corpus, and +5% for WebApplication Corpus. However, for 20News-Groups, the results of binarized HD Vectors are lower than non-HD Vectors. This is mainly due to the large size of the dataset, and simple classifiers like LinearSVC failed to perform with just binarized values. The results for all the other classifiers are provided in the Appendix. Table 3 compares the F 1 scores of various platforms on the intent classification datasets. We report the results of binarized HD vectors with the best classifiers from one of the nine classifiers mentioned (Binarized HD vectors with the best classifier), non-binarized HD vectors with Text-LeNet (HD Text-LeNet) and binarized HD vectors with binarized Text-LeNet (HD BNN). Our end to end binarized architecture (HD BNN) achieved the state-of-the-art results for the Chatbot dataset. The approach where only HD vectors were binarized (binarized HD vectors with the best classifier) achieved the state-of-the-art results for the AskUbuntu dataset. The results on the WebApplication dataset are comparable to the state-of-theart (0.87 with SemHash): 0.84 for binarized HD vectors with the best classifier and 0.83 for HD BNN. The average performance of both binarized HD vectors with the best classifier (0.92) and HD BNN (0.91) was also comparable to the best nonbinarized approach (0.92).

Conclusion
In this work, we show that it is possible to achieve comparable to the state-of-the-art results while using the binarized representations of all the components of the text classification architecture. This allows exploring the effectiveness of binary representations both for reducing the memory footprint of the architecture and for increasing the energy-efficiency of the inference phase due to the effectiveness of binary operations. This work takes a step towards enabling NLP functionality on resource-constrained devices.