Lightweight and Efficient Neural Natural Language Processing with Quaternion Networks

Many state-of-the-art neural models for NLP are heavily parameterized and thus memory inefficient. This paper proposes a series of lightweight and memory efficient neural architectures for a potpourri of natural language processing (NLP) tasks. To this end, our models exploit computation using Quaternion algebra and hypercomplex spaces, enabling not only expressive inter-component interactions but also significantly (75%) reduced parameter size due to lesser degrees of freedom in the Hamilton product. We propose Quaternion variants of models, giving rise to new architectures such as the Quaternion attention Model and Quaternion Transformer. Extensive experiments on a battery of NLP tasks demonstrates the utility of proposed Quaternion-inspired models, enabling up to 75% reduction in parameter size without significant loss in performance.


Introduction
Neural network architectures such as Transformers (Vaswani et al., 2017;Dehghani et al., 2018) and attention networks (Parikh et al., 2016;Seo et al., 2016;Bahdanau et al., 2014) are dominant solutions in natural language processing (NLP) research today. Many of these architectures are primarily concerned with learning useful feature representations from data in which providing a strong architectural inductive bias is known to be extremely helpful for obtaining stellar results.
Unfortunately, many of these models are known to be heavily parameterized, with state-of-the-art models easily containing millions or billions of parameters (Vaswani et al., 2017;Radford et al., 2018;Devlin et al., 2018;Radford et al., 2019). This renders practical deployment challenging. As such, the enabling of efficient and lightweight * Work done while at University of Maryland. adaptations of these models, without significantly degrading performance, would certainly have a positive impact on many real world applications.
To this end, this paper explores a new way to improve/maintain the performance of these neural architectures while substantially reducing the parameter cost (compression of up to 75%). In order to achieve this, we move beyond real space, exploring computation in Quaternion space (i.e., hypercomplex numbers) as an inductive bias. Hypercomplex numbers comprise of a real and three imaginary components (e.g., i, j, k) in which interdependencies between these components are encoded naturally during training via the Hamilton product ⊗. Hamilton products have fewer degrees of freedom, enabling up to four times compression of model size. Technical details are deferred to subsequent sections.
While Quaternion connectionist architectures have been considered in various deep learning application areas such as speech recognition (Parcollet et al., 2018b), kinematics/human motion (Pavllo et al., 2018) and computer vision (Gaudet and Maida, 2017), our work is the first hypercomplex inductive bias designed for a wide spread of NLP tasks. Other fields have motivated the usage of Quaternions primarily due to their natural 3 or 4 dimensional input features (e.g., RGB scenes or 3D human poses) (Parcollet et al., 2018b;Pavllo et al., 2018). In a similar vein, we can similarly motivate this by considering the multi-sense nature of natural language (Li and Jurafsky, 2015;Neelakantan et al., 2015;Huang et al., 2012). In this case, having multiple embeddings or components per token is well-aligned with this motivation.
Latent interactions between components may also enjoy additional benefits, especially pertaining to applications which require learning pairwise affinity scores (Parikh et al., 2016;Seo et al., 2016). Intuitively, instead of regular (real) dot products, Hamilton products ⊗ extensively learn representations by matching across multiple (inter-latent) components in hypercomplex space. Alternatively, the effectiveness of multi-view and multi-headed (Vaswani et al., 2017) approaches may also explain the suitability of Quaternion spaces in NLP models. The added advantage to multi-headed approaches is that Quaternion spaces explicitly encodes latent interactions between these components or heads via the Hamilton product which intuitively increases the expressiveness of the model. Conversely, multi-headed embeddings are generally independently produced.
To this end, we propose two Quaternioninspired neural architectures, namely, the Quaternion attention model and the Quaternion Transformer. In this paper, we devise and formulate a new attention (and self-attention) mechanism in Quaternion space using Hamilton products. Transformation layers are aptly replaced with Quaternion feed-forward networks, yielding substantial improvements in parameter size (of up to 75% compression) while achieving comparable (and occasionally better) performance.
Contributions All in all, we make the following major contributions: • We propose Quaternion neural models for NLP. More concretely, we propose a novel Quaternion attention model and Quaternion Transformer for a wide range of NLP tasks.
To the best of our knowledge, this is the first formulation of hypercomplex Attention and Quaternion models for NLP.
• We evaluate our Quaternion NLP models on a wide range of diverse NLP tasks such as pairwise text classification (natural language inference, question answering, paraphrase identification, dialogue prediction), neural machine translation (NMT), sentiment analysis, mathematical language understanding (MLU), and subject-verb agreement (SVA).
• Our experimental results show that Quaternion models achieve comparable or better performance to their real-valued counterparts with up to a 75% reduction in parameter costs. The key advantage is that these models are expressive (due to Hamiltons) and also parameter efficient. Moreover, our Quater-nion components are self-contained and play well with real-valued counterparts.

Background on Quaternion Algebra
This section introduces the necessary background for this paper. We introduce Quaternion algebra along with Hamilton products, which form the crux of our proposed approaches.
Quaternion A Quaternion Q ∈ H is a hypercomplex number with three imaginary components as follows: where ijk = i 2 = j 2 = k 2 = −1 and noncommutative multiplication rules apply: ij = k, jk = i, ki = j, ji = −k, kj = −i, ik = −j. In (1), r is the real value and similarly, x, y, z are real numbers that represent the imaginary components of the Quaternion vector Q. Operations on Quaternions are defined in the following.
Addition The addition of two Quaternions is defined as: where Q and P with subscripts denote the real value and imaginary components of Quaternion Q and P . Subtraction follows this same principle analogously but flipping + with −.
Conjugate The conjugate of Q is defined as: Norm The unit Quaternion Q is defined as: Hamilton Product The Hamilton product, which represents the multiplication of two Quaternions Q and P , is defined as: which intuitively encourages inter-latent interaction between all the four components of Q and P . In this work, we use Hamilton products extensively for vector and matrix transformations that live at the heart of attention models for NLP.

Quaternion Models of Language
In this section, we propose Quaternion neural models for language processing tasks. We begin by introducing the building blocks, such as Quaternion feed-forward, Quaternion attention, and Quaternion Transformers.

Quaternion Feed-Forward
A Quaternion feed-forward layer is similar to a feed-forward layer in real space, while the former operates in hypercomplex space where Hamilton product is used. Denote by W ∈ H the weight parameter of a Quaternion feed-forward layer and let Q ∈ H be the layer input. The linear output of the layer is the Hamilton product of two Quaternions: W ⊗ Q.
Saving Parameters? How and Why In lieu of the fact that it might not be completely obvious at first glance why Quaternion models result in models with smaller parameterization, we dedicate the following to address this. For the sake of parameterization comparison, let us express the Hamilton product W ⊗ Q in a Quaternion feed-forward layer in the form of matrix multiplication, which is used in real-space feed-forward. Recall the definition of Hamilton product in (2). Putting aside the Quaterion unit basis [1, i, j, k] , W ⊗ Q can be expressed as: We highlight that, there are only 4 distinct parameter variable elements (4 degrees of freedom), namely W r , W x , W y , W z , in the weight matrix (left) of (3), as illustrated by Figure 1; while in real-space feed-forward, all the elements of the weight matrix are different parameter variables (4 × 4 = 16 degrees of freedom). In other words, the degrees of freedom in Quaternion feedforward is only a quarter of those in its real-space Figure 1: 4 weight parameter variables (W r , W x , W y , W z ) are used in 16 pairwise connections between components of the input and output Quaternions.
counterpart, resulting in a 75% reduction in parameterization. Such a parameterization reduction can also be explained by weight sharing (Parcollet et al., 2018b,a).
Nonlinearity Nonlinearity can be added to a Quaternion feed-forward layer and componentwise activation is adopted (Parcollet et al., 2018a): where Q is defined in (1) and α(.) is a nonlinear function such as tanh or ReLU.

Quaternion Attention
Next, we propose a Quaternion attention model to compute attention and alignment between two sequences. Let A ∈ H a×d and B ∈ H b ×d be input word sequences, where a , b are numbers of tokens in each sequence and d is the dimension of each input vector. We first compute: where E ∈ H a× b . We apply Softmax(.) to E component-wise: where G and B with subscripts represent the real and imaginary components of G and B. Similarly, we perform the same on A which is described as follows: where A is the aligned representation of B and B is the aligned representation of A. Next, given A ∈ R b ×d , B ∈ R A ×d we then compute and compare the learned alignments: where QFFN(.) is a Quaternion feed-forward layer with nonlinearity and [; ] is the component-wise contatentation operator. i refers to word positional indices and over words in the sequence. Both outputs C 1 , C 2 are then passed where Y ∈ H is a Quaternion valued output. In order to train our model end-to-end with real-valued losses, we concatenate each component and pass into a final linear layer for classification.

Quaternion Transformer
This section describes our Quaternion adaptation of Transformer networks. Transformer (Vaswani et al., 2017) can be considered state-of-the-art across many NLP tasks. Transformer networks are characterized by stacked layers of linear transforms along with its signature self-attention mechanism. For the sake of brevity, we outline the specific changes we make to the Transformer model.
Quaternion Self-Attention The standard selfattention mechanism considers the following: where Q, K, V are traditionally learned via linear transforms from the input X. The key idea here is that we replace this linear transform with a Quaternion transform.
where ⊗ is the Hamilton product and X is the input Quaternion representation of the layer. In this case, since computation is performed in Quaternion space, the parameters of W is effectively reduced by 75%. Similarly, the computation of selfattention also relies on Hamilton products. The revised Quaternion self-attention is defined as follows: Note that in (4), Q ⊗ K returns four × matrices (attention weights) for each component (r, i, j, k). Softmax is applied component-wise, along with multiplication with V which is multiplied in similar fashion to the Quaternion attention model. Note that the Hamilton product in the selfattention itself does not change the parameter size of the network.
Quaternion Transformer Block Aside from the linear transformations for forming query, key, and values. Tranformers also contain position feed-forward networks with ReLU activations. Similarly, we replace the feed-forward connections (FFNs) with Quaternion FFNs. We denote this as Quaternion Transformer (full) while denoting the model that only uses Quaternion FFNs in the self-attention as (partial). Finally, the remainder of the Transformer networks remain identical to the original design (Vaswani et al., 2017) in the sense that component-wise functions are applied unless specified above.

Embedding Layers
In the case where the word embedding layer is trained from scratch (i.e., using Byte-pair encoding in machine translation), we treat each embedding to be the concatenation of its four components. In the case where pre-trained embeddings such as GloVe (Pennington et al., 2014) are used, a nonlinear transform is used to project the embeddings into Quaternion space.

Connection to Real Components
A vast majority of neural components in the deep learning arsenal operate in real space. As such, it would be beneficial for our Quaternion-inspired components to interface seamlessly with these components. If input to a Quaternion module (such as Quaternion FFN or attention modules), we simply treat the real-valued input as a concatenation of components r, x, y, z. Similarly, the output of the Quaternion module, if passed to a realvalued layer, is treated as a [r; x; y; z], where [; ] is the concatenation operator.
Output layer and Loss Functions To train our model, we simply concatenate all r, i, j, k components into a single vector at the final output layer. For example, for classification, the final Softmax output is defined as following: where Y ∈ R |C| where |C| is the number of classes and x, y, z are the imaginary components. Similarly for sequence loss (for sequence transduction problems), the same can be also done.
Parameter Initialization It is intuitive that specialized initialization schemes ought to be devised for Quaternion representations and their modules (Parcollet et al., 2018b,a).
where q imag is the normalized imaginary constructed from uniform randomly sampling from [0, 1]. θ is randomly and uniformly sampled from [−π, π]. However, our early experiments show that, at least within the context of NLP applications, this initialization performed comparable or worse than the standard Glorot initialization. Hence, we opt to initialize all components independently with Glorot initialization.

Experiments
This section describes our experimental setup across multiple diverse NLP tasks. All experiments were run on NVIDIA Titan X hardware.
Our Models On pairwise text classification, we benchmark Quaternion attention model (Q-Att), testing the ability of Quaternion models on pairwise representation learning. On all the other tasks, such as machine translation and subjectverb agreement, we evaluate Quaternion Transformers. We evaluate two variations of Transformers, full and partial. The full setting converts all linear transformations into Quaternion space and is approximately 25% of the actual Transformer size. The second setting (partial) only reduces the linear transforms at the self-attention mechanism. Tensor2Tensor 1 is used for Transformer benchmarks, which uses its default Hyperparameters and encoding for all experiments.

Pairwise Text Classification
We evaluate our proposed Quaternion attention (Q-Att) model on pairwise text classification tasks. This task involves predicting a label or ranking score for sentence pairs. We use a total of seven data sets from problem domains such as: • Natural language inference (NLI) -This task is concerned with determining if two sentences entail or contradict each other. We use SNLI (Bowman et al., 2015), Sc-iTail (Khot et al., 2018), MNLI (Williams et al., 2017) as benchmark data sets.
• Question answering (QA) -This task involves learning to rank question-answer pairs. We use WikiQA (Yang et al., 2015) which comprises of QA pairs from Bing Search.
• Paraphrase detection -This task involves detecting if two sentences are paraphrases of each other. We use Tweets (Lan et al., 2017) data set and the Quora paraphrase data set (Wang et al., 2017).
• Dialogue response selection -This is a response selection (RS) task that tries to select the best response given a message. We use the Ubuntu dialogue corpus, UDC (Lowe et al., 2015).

Baselines and Comparison
We use the Decomposable Attention model as a baseline, adding [a i ; b i ; a i b i ; a i − b i ] before the compare 2 layers since we found this simple modification to increase performance. This also enables fair comparison with our variation of Quaternion attention which uses Hamilton product over Element-wise multiplication. We denote this as DeAtt. We evaluate at a fixed representation size of d = 200  83.9 (+1.3%) 80.5 (+1.6%) 100K (-75.0%) Quaternion Transformer (partial) 83.6 (+1.0%) 81.4 (+2.5%) 300K (-25.0%) (equivalent to d = 50 in Quaternion space). We also include comparisons at equal parameterization (d = 50 and approximately 200K parameters) to observe the effect of Quaternion representations. We selection of DeAtt is owing to simplicity and ease of comparison. We defer the prospect of Quaternion variations of more advanced models Tay et al., 2017b) to future work.
Results Table 1 reports results on seven different and diverse data sets. We observe that a tiny Q-Att model (d = 50) achieves comparable (or occasionally marginally better or worse) performance compared to DeAtt (d = 200), gaining a 68% parameter savings. The results actually improve on certain data sets (2/7) and are comparable (often less than a percentage point difference) compared with the d = 200 DeAtt model. Moreover, we scaled the parameter size of the DeAtt model to be similar to the Q-Att model and found that the performance degrades quite significantly (about 2% − 3% lower on all data sets). This demonstrates the quality and benefit of learning with Quaternion space.

Sentiment Analysis
We evaluate on the task of document-level sentiment analysis which is a binary classification problem.

Implementation Details
We compare our proposed Quaternion Transformer against the vanilla Transformer. In this experiment, we use the tiny Transformer setting in Tensor2Tensor with a vocab size of 8K. We use two data sets, namely IMDb (Maas et al., 2011) and Stanford Sentiment Treebank (SST) (Socher et al., 2013).
Results Table 2 reports results the sentiment classification task on IMDb and SST. We observe that both the full and partial variation of Quaternion Transformers outperform the base Transformer. We observe that Quaternion Transformer (partial) obtains a +1.0% lead over the vanilla Transformer on IMDb and +2.5% on SST. This is while having a 24.5% saving in parameter cost. Finally the full Quaternion version leads by +1.3%/1.6% gains on IMDb and SST respectively while maintaining a 75% reduction in parameter cost. This supports our core hypothesis of improving accuracy while saving parameter costs.

Neural Machine Translation
We evaluate our proposed Quaternion Transformer against vanilla Transformer on three data sets on this neural machine translation (NMT) task. More concretely, we evaluate on IWSLT 2015 English Vietnamese (En-Vi), WMT 2016 English-Romanian (En-Ro) and WMT 2018 English-Estonian (En-Et). We also include results on the standard WMT EN-DE English-German results.
Implementation Details We implement models in Tensor2Tensor and trained for 50k steps for both models. We use the default base single GPU hyperparameter setting for both models and average checkpointing. Note that our goal is not to obtain state-of-the-art models but to fairly and systematically evaluate both vanilla and Quaternion Transformers.  Results Table 3 reports the results on neural machine translation. On the IWSLT'15 En-Vi data set, the partial adaptation of the Quaternion Transformer outperforms (+2.5%) the base Transformer with a 32% reduction in parameter cost. On the other hand, the full adaptation comes close (−0.4%) with a 75% reduction in paramter cost. On the WMT'16 En-Ro data set, Quaternion Transformers do not outperform the base Transformer. We observe a −0.1% degrade in performance on the partial adaptation and −4.3% degrade on the full adaptation of the Quaternion Transformer. However, we note that the drop in performance with respect to parameter savings is still quite decent, e.g., saving 32% parameters for a drop of only 0.1 BLEU points. The full adaptation loses out comparatively. On the WMT'18 En-Et dataset, the partial adaptation achieves the best result with 32% less parameters. The full adaptation, comparatively, only loses by 1.0 BLEU score from the original Transformer yet saving 75% parameters.
WMT English-German Notably, Quaternion Transformer achieves a BLEU score of 26.42/25.14 for partial/full settings respectively on the standard WMT 2014 En-De benchmark. This is using a single GPU trained for 1M steps with a batch size of 8192. We note that results do not differ much from other single GPU runs (i.e., 26.07 BLEU) on this dataset (Nguyen and Joty, 2019).

Mathematical Language Understanding
We include evaluations on a newly released mathematical language understanding (MLU) data set (Wangperawong, 2018). This data set is a character-level transduction task that aims to test a model's the compositional reasoning capabilities. For example, given an input x = 85, y = −523, x * y the model strives to decode an output of −44455. Several variations of these problems exist, mainly switching and introduction of new mathematical operators.

Implementation Details We train Quaternion
Transformer for 100K steps using the default Tensor2Tensor setting following the original work (Wangperawong, 2018). We use the tiny hyperparameter setting. Similar to NMT, we report both full and partial adaptations of Quaternion Transformers. Baselines are reported from the original work as well, which includes comparisons from Universal Transformers (Dehghani et al., 2018) and Adaptive Computation Time (ACT) Universal Transformers. The evaluation measure is accuracy per sequence, which counts a generated sequence as correct if and only if the entire sequence is an exact match.
Results Table 4 reports our experimental results on the MLU data set. We observe a modest +7.8% accuracy gain when using the Quaternion Transformer (partial) while saving 24.5% parameter costs. Quaternion Transformer outperforms Universal Transformer and marginally is outperformed by Adaptive Computation Universal Transformer (ACT U-Transformer) by 0.5%. On the other hand, a full Quaternion Transformer still outperforms the base Transformer (+2.8%) with 75% parameter saving.

Subject Verb Agreement
Additionally, we compare our Quaternion Transformer on the subject-verb agreement task (Linzen et al., 2016). The task is a binary classification problem, determining if a sentence, e.g., 'The keys to the cabinet .' follows by a plural/singular.

Implementation
We use the Tensor2Tensor framework, training Transformer and Quaternion Transformer with the tiny hyperparameter setting with 10k steps.  ers perform equally (or better) than vanilla Transformers. On this task, the partial adaptation performs better, improving Transformers by +0.7% accuracy while saving 25% parameters.

Related Work
The goal of learning effective representations lives at the heart of deep learning research. While most neural architectures for NLP have mainly explored the usage of real-valued representations (Vaswani et al., 2017;Bahdanau et al., 2014;Parikh et al., 2016), there have also been emerging interest in complex (Danihelka et al., 2016;Arjovsky et al., 2016;Gaudet and Maida, 2017) and hypercomplex representations (Parcollet et al., 2018b,a;Gaudet and Maida, 2017).
Notably, progress on Quaternion and hypercomplex representations for deep learning is still in its infancy and consequently, most works on this topic are very recent. Gaudet and Maida proposed deep Quaternion networks for image classification, introducing basic tools such as Quaternion batch normalization or Quaternion initialization (Gaudet and Maida, 2017). In a similar vein, Quaternion RNNs and CNNs were proposed for speech recognition (Parcollet et al., 2018a,b). In parallel Zhu et al. proposed Quaternion CNNs and applied them to image classification and denoising tasks (Zhu et al., 2018). Comminiello et al. proposed Quaternion CNNs for sound detection (Comminiello et al., 2018). (Zhang et al., 2019a) proposed Quaternion embeddings of knowledge graphs. (Zhang et al., 2019b) proposed Quaternion representations for collaborative filtering. A common theme is that Quaternion representations are helpful and provide utility over realvalued representations.
The interest in non-real spaces can be attributed to several factors. Firstly, complex weight matrices used to parameterize RNNs help to combat vanishing gradients (Arjovsky et al., 2016). On the other hand, complex spaces are also intuitively linked to associative composition, along with holographic reduced representations (Plate, 1991;Nickel et al., 2016;Tay et al., 2017a). Asymmetry has also demonstrated utility in domains such as relational learning (Trouillon et al., 2016;Nickel et al., 2016) and question answering (Tay et al., 2018). Complex networks (Trabelsi et al., 2017), in general, have also demonstrated promise over real networks.
In a similar vein, the hypercomplex Hamilton product provides a greater extent of expressiveness, similar to the complex Hermitian product, albeit with a 4-fold increase in interactions between real and imaginary components. In the case of Quaternion representations, due to parameter saving in the Hamilton product, models also enjoy a 75% reduction in parameter size.
Our work draws important links to multihead (Vaswani et al., 2017) or multi-sense (Li and Jurafsky, 2015;Neelakantan et al., 2015) representations that are highly popular in NLP research. Intuitively, the four-component structure of Quaternion representations can also be interpreted as some kind of multi-headed architecture. The key difference is that the basic operators (e.g., Hamilton product) provides an inductive bias that encourages interactions between these components. Notably, the idea of splitting vectors has also been explored (Daniluk et al., 2017), which is in similar spirit to breaking a vector into four components.

Conclusion
This paper advocates for lightweight and efficient neural NLP via Quaternion representations. More concretely, we proposed two models -Quaternion attention model and Quaternion Transformer. We evaluate these models on eight different NLP tasks and a total of thirteen data sets. Across all data sets the Quaternion model achieves comparable performance while reducing parameter size. All in all, we demonstrated the utility and benefits of incorporating Quaternion algebra in state-of-theart neural models. We believe that this direction paves the way for more efficient and effective representation learning in NLP. Our Tensor2Tensor implementation of Quaternion Transformers will be released at https://github.com/ vanzytay/QuaternionTransformers.