Tensorized Embedding Layers

The embedding layers transforming input words into real vectors are the key components of deep neural networks used in natural language processing. However, when the vocabulary is large, the corresponding weight matrices can be enormous, which precludes their deployment in a limited resource setting. We introduce a novel way of parameterizing embedding layers based on the Tensor Train decomposition, which allows compressing the model significantly at the cost of a negligible drop or even a slight gain in performance. We evaluate our method on a wide range of benchmarks in natural language processing and analyze the trade-off between performance and compression ratios for a wide range of architectures, from MLPs to LSTMs and Transformers.


Introduction
Deep neural networks (DNNs) typically used in natural language processing (NLP) employ large embeddings layers, which map the input words into continuous representations and usually have the form of lookup tables. Despite such simplicity, these layers often occupy a large portion of model weights which may cause problems in training and deploying them in a limited resource setting. Thus, the compression of large neural networks and the development of novel lightweight architectures have become essential problems in NLP research.
One way to reduce the number of parameters in the trained model is to imply a specific structure on its weight matrices, e.g., assume that they are low-rank or can be well approximated by lowrank tensor networks (Jaderberg et al., 2014). Such approaches are successful at compressing the pretrained models, but they do not facilitate the training itself. Furthermore, they usually require an * The first three authors contributed equally to this work additional fine-tuning stage to recover the performance of the original model (Lebedev et al., 2015;Chen et al., 2018a).
In this paper, we introduce a new, parameter efficient embedding layer, termed TT-embedding, which can be plugged in into any model and trained end-to-end. The benefits of our compressed TTlayer are twofold. Firstly, instead of storing huge embedding matrix, we store a sequence of much smaller 2-dimensional and 3-dimensional tensors, necessary for reconstructing the required embeddings, which allows compressing the model significantly at the cost of a negligible performance drop. Secondly, the overall number of parameters can be relatively small (and constant) during training.
The main contributions of our paper are: • We propose to replace a standard dense embedding matrix with a novel compactly parameterized TT-embedding layer.
• We provide a theoretical justification of the proposed method from the softmax bottleneck (Yang et al., 2017b) perspective.
• We evaluate TT-embedding on a variety of benchmarks in NLP and report better compression-accuracy trade-off than standard embedding and its low-rank decomposition.

Related work
In recent years, a large body of research was devoted to compressing and speeding up various components of neural networks used in NLP tasks. Joulin et al. (2016) adapted the framework of product quantization to reduce the number of parameters in linear models used for text classification. See et al. (2016) proposed to compress LSTM-based neural machine translation models with pruning algorithms. Lobacheva et al. (2017) showed that the recurrent models could be significantly sparsified with the help of variational dropout (Kingma et al., 2015). Cheong and Daniel (2019) successfully compressed the Transformer architecture with the combination of pruning and quantization.
There is a plethora of prior work on compressing the embedding layers used in NLP models. Chen et al. (2018b) proposed more compact Kway D-dimensional discrete encoding scheme to replace the "one-hot" encoding of categorical features, such as words in NLP taks. Variani et al. (2018) introduced WEST, a compression method based on structured sparse and structured dense decomposition of the embedding matrix. Chen et al. (2018a) proposed to compress the pre-trained embedding matrix by capitalizing on the powerlaw distribution of words and using smaller dimensionality (lower rank) for the embeddings of less frequent words. Baevski and Auli (2018) used a similar idea in end-to-end fashion by training such structured low-rank embeddings from scratch. However, both of these methods rely on the assumption of power-law distribution of tokens and are not efficient when dealing with other popular tokenizations, such as wordpieces (Schuster and Nakajima, 2012;Wu et al., 2016) or BPEs (Sennrich et al., 2015). The effectiveness of simple low-rank factorized embeddings has been recently re-discovered by Lan et al. (2019), and we refer to this method as to important baseline. Also, Lam (2018) proposed a quantization algorithm for compressing word vectors, but its benefits are orthogonal to those of lowrank matrix and tensor factorizations and they can be used together, complementing each other.
Tensor methods have also been already successfully applied to neural networks compression. Novikov et al. (2015) coined the idea of reshaping weights of fully-connected layers into high-dimensional tensors and representing them in Tensor Train (TT) (Oseledets, 2011) format. This approach was later extended to convolutional (Garipov et al., 2016) and recurrent (Yang et al., 2017a;Tjandra et al., 2017;Yu et al., 2017) neural networks. Furthermore, Lebedev et al. (2015) showed that convolutional layers could be also compressed with canonical (CP) tensor decomposition (Carroll and Chang, 1970;Harshman, 1970). Finally, Wang et al. (2018) compressed both fully-connected and convolutional layers with Tensor Ring decomposition (Zhao et al., 2016). Recently, Ma et al. (2019) succesfully applied Block-Term Tensor Decomposition to the compression of self-attention modules in the Transformer (Vaswani et al., 2017) architecture. In this work, we show the benefits of applying tensor machinery to the compression of embedding layers, which are an essential component of all models used in NLP.

Motivation
Since most of the parameters in the NLP models occupy the embedding layers, we can greatly reduce size of the entire model by compressing these layers. Our goal is to replace the standard embedding matrix with a more compact, yet powerful and trainable, representation which would allow us to efficiently map words into vectors.
In this section, we briefly discuss our motivation of using tensorized embedding layers instead of both standard embedding layers and their low-rank factorized counterpart.

Compression ratio perspective
The simplest approach to compactly represent a matrix of a large size is to use the low-rank matrix factorization, which treats matrix E ∈ R I×J as a product of two matrices E = UV . Here U ∈ R I×R and V ∈ R J×R are much "thinner" matrices, and R is the rank hyperparameter. Note that rather than training the model with the standard embedding layer, and then trying to compress the obtained embedding, we can initially seek the embedding matrix in the described low-rank format. Then, for evaluation and training, the individual word embedding E[i, :] can be computed as a product U[i, :]V which does not require materializing the full matrix E. This approach reduces the number of degrees of freedom in the embedding layer from IJ to (I + J)R.
However, typically, in the NLP tasks, the embedding dimension J is much smaller than the vocabulary size I, and obtaining significant compression ratio using low-rank matrix factorization is problematic. In order to preserve the model performance, the rank R cannot be taken very small, and the compression ratio is bounded by IJ (I+J)R ≤ J R , which is close to 1 for usually full-rank embedding matrix (see Figure 1 in (Chen et al., 2018b)). To overcome this bound and achieve significant compression ratio even for matrices of disproportional dimensionalities, we reshape them into multidimensional tensors and apply the Tensor Train decomposition, which allows for more compact representation with the number of parameters falling down to logarithmic with respect to I.

Softmax bottleneck perspective
We hypothesize that such tensorized embeddings are not only superior in terms of more efficient compression, but are more theoretically justified for the usage in NLP tasks than embedding layers based on matrix factorization. Our analysis is based on softmax bottleneck theory (Yang et al., 2017b) and the fact that modern NLP architectures typically use the same weights for both embedding and softmax layers (Press and Wolf, 2016;Inan et al., 2016).
This theory models a natural language as a collection of pairs of a context and its conditional next token distributions: , and considers parametric language models with a Softmax function operating on a context vector h c and a word embedding x i to define the conditional distribution P θ (x|c). Given the number of context vectors N , the number of tokens M , and dimensionality of word embeddings d, the following three matrices are defined: H θ ∈ R N ×d , W θ ∈ R M ×d , A ∈ R N ×M . The rows of these matrices correspond to context vectors, word embeddings, and log probabilities of the true data distribution respectively. Such language model attempts to approximate A (up to an addition of constant matrices corresponding to a degree of freedom in Softmax) in the form (1) Note that the rank of H θ W θ is bounded by d, while the matrix A is presumed to be a high rank matrix (Yang et al., 2017a), which provides an upper bound on expressivity of such models. Now, suppose that the matrix W θ is additionally factorized as W θ = U θ V θ with some rank R. Then the rank of right-hand side of Equation (1) is bounded by R, which further reduces expressivity of such models. Contrary to this, we show that tensorized embeddings do not reduce expressivity in the softmax bottleneck sense -while the embedding matrix is compressed it still has full matrix rank. We provide a rigorous statement in Section 4.4 and verify benefits of tensorized embeddings over lowrank factorized ones empirically in Section 5.

Tensor Train embedding
In this section, we briefly introduce the necessary notation and present the algorithm for training the TT-embedding layer. Hereinafter, by N -way tensor X we mean a multidimensional array:

Tensor Train decomposition
A tensor X is said to be represented in the Tensor Train (TT) format (Oseledets, 2011) if each element of X can be computed as: for which the TT-decomposition exists are called TT-ranks. Note, that the element X (i 1 , i 2 . . . i N ) is effectively the product of 2 vectors and N − 2 matrices: where G (k) [:, i k , :] stands for the slice (a subset of a tensor with some indices fixed) of the corresponding TT-core G (k) .
The number of degrees of freedom in such a decomposition can be evaluated as N k=1 R k−1 I k R k . Thus, in the case of small ranks, the total number of parameters required to store a tensor in TT-representation is significantly smaller than N k=1 I k parameters required to store the full tensor of the corresponding size. This observation makes the application of the TT-decomposition appealing in many problems dealing with extremely large tensors.

TT-matrix
Let X ∈ R I×J be a matrix of size I × J. Given two arbitrary factorizations of its dimensions into natural numbers, I = N k=1 I k and J = N k=1 J k , Figure 1: Construction of the TT-matrix from the standard embedding matrix. Blue color depicts how the single element in the initial matrix is transformed into the product of the highlighted vectors and matrices in the TT-cores.
we can reshape 1 and transpose this matrix into an N -way tensor X ∈ R I 1 J 1 ×I 2 J 2 ×···×I N J N and then apply the TT-decomposition to it, resulting in a more compact representation. More concretely, define the bijections I(i) = (i 1 , . . . , i N ) and J (j) = (j 1 , . . . , j N ) that map row and column indices i and j of the matrix X to the N -dimensional vector-indices such that 0 ≤ i k < I k , 0 ≤ j k < J k , ∀k = 1, . . . , N . From the matrix X we can form an N -way tensor X whose k-th dimension is of length I k J k and is indexed by the tuple (i k , j k ). This tensor is then represented in the TT-format: Such representation of the matrix in the TT-format is called TT-matrix (Oseledets, 2010;Novikov et al., 2015) and is also known as Matrix Product Operator (Pirvu et al., 2010) in physics literature. The factorizations (I 1 , I 2 , . . . I N ) × (J 1 , J 2 , . . . J N ) will be referred to as the shape of the TT-matrix, or TT-shapes. The construction of the TT-matrix from the standard matrix is visualized in Figure 1 for the tensor of order 3. Note, that in this case the TT-cores are in fact 4-th order tensors as the indices are given by tuples (i k , j k ), but all the operations defined for tensors in the TT-format are naturally extended to TT-matrices.

TT-embedding
By TT-embedding, we call a layer with trainable parameters (TT-cores) represented as a TT-matrix E of the underlying tensor shape (I 1 , I 2 , . . . I N ) × (J 1 , J 2 , . . . J N ) which can be transformed into a valid embedding layer E ∈ R I×J , with I = N k=1 I k and J = N k=1 J k . To specify the shapes of TT-cores one has also to provide the TT-ranks, which are treated as hyperparameters of the layer and explicitly define the total compression ratio.
In order to compute the embedding for a particular word indexed i in the vocabulary, we first map the row index i into the N -dimensional vector index (i 1 , . . . , i N ), and then calculate components of the embedding with formula (2). Note, that the computation of all its components is equivalent to selecting the particular slices in TT-cores (slices of shapes and so on) and performing a sequence of matrix multiplications, which is executed efficiently in modern linear algebra packages, such as BLAS. Pseudocode for the procedure of computing the mapping i → (i 1 , . . . , i N ) is given in Appendix A.
In order to construct TT-embedding layer for a vocabulary of size I and embedding dimension J, and to train a model with such a layer, one has to perform the following steps.
• Initialize the set of parameters of the embed- . Concrete initialization scenarios are discussed further in the text.
• Computed embeddings can be followed by any standard layer such as LSTM (Hochreiter and Schmidhuber, 1997) or selfattention (Vaswani et al., 2017), and trained with backpropagation since they differentially depend on the parameters Θ.
TT-embedding implies a specific structure on the order of tokens in the vocabulary (the order of rows in the embedding matrix), and determining the optimal order is an appealing problem to solve. However, we leave this problem for future work and use the order produced by the standard tokenizer (sorted by frequency) in our current experiments.
We also experimented with a more general form of TT-decomposition, namely Tensor Ring (TR) decomposition (Zhao et al., 2016;Wang et al., 2018). This decomposition by construction has the appealing property of being circular permutation invariant (and, thus, more robust with respect to the order of the tokens), which could have potentially provided an improvement over the TT-based models with simple frequency based ordering. However, despite having stronger generalization abilities, TR might require more intricate optimization procedure (Section 2.5 in Grasedyck et al. (2013)), and we did not observe the benefits of using TR instead of TT in our experiments (Appendix C).
Initialization The standard way to initialize an embedding matrix E ∈ R I×J is via, e.g., Glorot initializer (Glorot and Bengio, 2010), which initializes each element as E(i, j) ∼ N 0, 2 I+J . For the TT-embedding, we can only initialize the TT-cores, and the distribution of the elements of the resulting matrix E is rather non-trivial. However, it is easy to verify that if we initialize each TT-core element as G (k) (r k−1 , i k , r k ) ∼ N (0, 1), the resulting distribution of the matrix elements E(i, j) has the property that E[E(i, j)] = 0 and Capitalizing on this observation, in order to obtain the desired variance Var[E(i, j)] = σ 2 while keeping E[E(i, j)] = 0, we can simply initialize each TTcore as The resulting distribution is not Gaussian, however, it approaches the Gaussian distribution 2 with the increase of the TT-rank ( Figure 2). In our experiments, we have used the modified Glorot initializer implemented by formula (3), which greatly improved performance, as opposed to initializing TT-cores simply via a standard normal distribution. It is also possible to initialize TTembedding layer by converting the learned embedding matrix into TT-format using the TT-SVD al- Figure 2: Distribution of matrix elements of the TTmatrix of shape (5, 5, 5, 5) × (5, 5, 5, 5) initialized by formula (3) with σ = 1. As the TT-rank increases, the resulting distribution approaches Gaussian N (0, 1). gorithm (Oseledets, 2011), however, this approach requires the pretrained embedding matrix and performs worse in practice (Garipov et al., 2016).
Hyperparameter selection TT-embedding introduces two additional structure-specific hyperparameters, namely TT-shapes and TT-ranks.
TT-embedding does not require the vocabulary size I to be represented exactly as the product of factors I 1 , . . . , I N , in fact, any factorization N k=1 I k = I ≥ I will suffice. However, in order to achieve the highest possible compression ratio for a fixed value of I, the factors {I k } N k=1 should be as close to each other as possible (Novikov et al., 2015;Yang et al., 2017a). Our implementation includes a simple automated procedure for selecting a good set of values ({I k } N k=1 , {J k } N k=1 ) during TTembedding initialization. The factors J 1 , . . . , J N are defined by the embedding dimensionality J which can be easily chosen to support good factorization, e.g., 512 = 8×8×8 or 480 = 6×5×4×4.
The values of TT-ranks directly define the compression ratio, so choosing them to be too small or too large will result into either significant performance drop or little reduction of the number of parameters. In our experiments, we set all TTranks to 16 for problems with small vocabularies and 64 − 192 for problems with larger vocabularies which resulted in a good trade-off between compression ratio and the metric of interest.

Expressivity of TT-embedding
Recall that in Section 3 we argued that one advantage of TT-embeddings is the property of being full rank matrices despite providing a significant data compression. Let us now formalize this statement.
For a fixed I = N k=1 I k , J = N k=1 J k , and a set of ranks R = (R 1 , R 2 , . . . , R N −1 ), we consider M R , the set of all tensors represented in the TT-matrix format such that for any X ∈ M R TT-rank(X ) ≤ R, entry-wise. Let X denote an ordinary matrix of size N × M obtained from the TT-matrix X with the inverse of procedure decsribed in Section 4.2 (application of formulas from Section 4.1, followed by transposing and reshaping). We show that the following results holds true.
Theorem 1. For all X ∈ M R besides a set of measure zero rank X = min(I, J), where the ordinary matrix rank is assumed.
See Appendix B for a proof. This theorem states that for almost all TTembeddings (besides a negligible set), the corresponding standard embedding matrix is full-rank. Thus, using the same matrix in the softmax layer, we can achieve significant compression without hitting the softmax bottleneck, as opposed to the low-rank matrix factorization.

Experiments
Code We have implemented TT-embeddings described in Section 4 in Python using PyTorch (Paszke et al., 2019). The code is available at the anonymous repository https://github.com/ttembedding/tt-embeddings.
Experimental setup We tested our approach on several popular NLP tasks: • Sentiment analysis -as a starting point in our experiments, we test TT-embeddings on a rather simple task of predicting sentiment.
• Neural Machine Translation (NMT) -to verify the applicability of TT-embeddings in more practical problems, we test it on a more challenging task of machine translation.
• Language Modeling (LM) -then, we evaluate TT-embeddings on language modeling task in the case of extremely large vocabulary.
• Click Through Rate (CTR) predictionfinally, we show that TT-embeddings can be applied for the binary classification with categorical features of significant cardinality.
To prove the generality and wide applicability of the proposed approach, we tested it on various architectures, such as MLPs (CTR), LSTMs (sentiment analysis), and Transformers (NMT, LM). The baselines we compare with are 1. Standard embedding layer parametrized by a matrix E ∈ R I×J with the baseline compression ratio of 1.
2. Low-rank factorized embedding layer parametrized by two matrices U ∈ R I×D and V ∈ R J×D such that the corresponding embedding matrix is E = UV . The compression ratio in this case is I×J (I+J)×D ≈ J D .
Note that Transformers in LM and NMT use the same weight matrix for their embedding and softmax layers (Press and Wolf, 2016;Inan et al., 2016) which already significantly reduces model size. Untying weights and tensorizing the embedding layer only will lead to the increase in the number of parameters instead of compression. In our experiments, we use two separate TT-decompositions of the same shape for embedding and softmax layers and report the compression ratios as |V |×d model 2×TT-params .

Sentiment analysis
For this experiment, we have used the IMDB dataset (Maas et al., 2011) with two categories, and the Stanford Sentiment Treebank (SST) (Socher et al., 2013) with five categories. We have taken the most frequent 25000 words for the IMDB dataset and 17200 for SST, embedded them into a J-dimensional space using either standard embedding or TT-embedding layer, and performed classification using a standard bidirectional two-layer LSTM with hidden size h = 128, and dropout rate P drop = 0.5.
Our findings are summarized in Table 1. We observe that the models with largely compressed embedding layers can perform equally or even better than the full uncompressed models. This suggests that learning individual independent embeddings for each particular word is superfluous, as the expressive power of LSTM is sufficient to make use of these intertwined, yet more compact embeddings. Moreover, slightly better test accuracy of the compressed models in certain cases (e.g., for the SST dataset of a rather small size) insinuates that imposing specific tensorial low-rank structure on the embedding matrix can be viewed as a special form of regularization, thus potentially improving  Our results are summarized in Table 2. We observe that even in this rather challenging task, both embedding and softmax layers can be compressed significantly, at the cost of a small drop in the BLEU score. However, with the increase of compression factor, the performance deteriorates rapidly. Compared to the sentiment analysis, NMT is a much more complex task which benefits more from additional capacity (in the form of more powerful RNN or more transformer blocks) rather than regularization (Bahdanau et al., 2014;Vaswani et al., 2017;Wu et al., 2019), which may explain why we did not manage to improve the model by regularizing its embedding layers with TT-embedding.
Compared to the low-rank factorization of the embedding layer, the BLEU score of the Transformer with TT-embedding is higher and degrades much slower with the decrease of TT-rank. We hy- TT-embeddings induce 8% training iteration time overhead if compared to the baseline Transformer-big due to our current implementation heavily relying on slow torch.einsum function while standard embedding and softmax layers make use of fast and highly-optimized Tensor Cores for mixed-precision training. We expect a dedicated CUDA kernel to be much more efficient.

Language modeling
We took the Transformer-XL (Dai et al., 2019), an open source 4 state-of-the-art language modeling architecture at the time of this writing, and replaced its embedding and softmax layers with TTfactorizations. Then, we tested different model configurations on WikiText-103 (Merity et al., 2016) and reported the results in Table 3. For the full list of hyperparameters, see the Appendix D.
Compared to sentiment analysis and NMT, we were not able to achieve that high compression ratios for embedding and softmax layers in LM. However, in our case of extremely large vocabulary (≈ 270000 words), even moderate 3.8 times compression allowed us to save 100M of weights at the cost of ∼ 1.5 perplexity drop. Note that TTembeddings also outperform low-rank factorization achieving better trade-off between compression and the performance.

Click Through Rate prediction
Among other applications of the TT-embedding layer, we chose to focus on CTR prediction, a pop-ular task in digital advertising (He et al., 2014). We consider open dataset provided by Criteo for Kaggle Display Advertising Challenge (Criteo Labs, 2014) which consists of 39 categorical features, 45.8M samples and is binary labeled according to whether the user clicked on the given advertisement. Unique values of categorical features are bijectively mapped into integers. To reduce the memory footprint, if the size of a corresponding vocabulary is immense (e.g., a cardinality of some features in this dataset is of order 10 6 ), these integers are further hashed by taking modulus with respect to some fixed number such as 10 5 . However, due to strong compression properties of TT-embeddings, this is not necessary for our approach, and we consider both full and hashed datasets in our experiments.
CTR with the baseline algorithm The task at hand can be treated as a binary classification problem. As a baseline algorithm, we consider the neural network with the following architecture. First, each of the categorical features is passed through a separate embedding layer with embedding size J. After that, the embedded features are concatenated and passed through 4 fully-connected layers of 1024 neurons and ReLU activation functions. In all experiments, we used Adam optimizer with the learning rate equal to 0.0005. Since many input features have a large number of unique values (e.g., 10131227) and storing the corresponding embedding matrices would be costly, we employ the hashing procedure mentioned earlier.
CTR with TT-embeddings We substitute the embedding layers with the TT-embedding layers. Besides that, we leave the overall structure of the neural network unchanged with the same parameters as in the baseline approach. Table 4 presents

Discussion and future work
We propose a novel embedding layer, the TTembedding, for compressing huge lookup tables used for encoding categorical features of significant cardinality, such as the index of a token in natural language processing tasks. The proposed approach, based on the TT-decomposition, experimentally proved to be effective, as it heavily decreases the number of training parameters at the cost of a small deterioration in performance. In addition, our method can be easily integrated into any deep learning framework and trained via backpropagation, while capitalizing on reduced memory requirements and increased training batch size. Our experimental results suggest several appealing directions for future work. First of all, TTembeddings impose a concrete tensorial low-rank structure on the embedding matrix, which was shown to improve the generalization ability of the networks acting as a regularizer. The properties and conditions of applicability of this regularizer are subject to more rigorous analysis. Secondly, unlike standard embedding, we can introduce nonlinearity into TT-cores to improve their expressive power (Khrulkov et al., 2019). Additionally, it is important to understand how the order of tokens in the vocabulary affects the properties of the networks with TT-embedding. We hypothesize that there exists the optimal order of tokens which better exploits the particular structure of TT-embedding and leads to a boost in performance and/or compression ratio. Finally, the idea of applying higherorder tensor decompositions to reduce the number of parameters in neural nets is complementary to more traditional methods such as pruning (Han et al., 2015) and quantization (Hubara et al., 2017;Xu et al., 2018). Thus, it would be interesting to make a thorough comparison of all these methods and investigate whether their combination may lead to even stronger compression.

A Multiindex construction
Algorithm 1 The algorithm implementing the bijection I(i) as described in Section 4.2.
Require: I -vocabulary size, {I k } N k=1 -an arbitrary factorization of I, i -index of the target word in vocabulary. Returns: Algorithm 2 The algorithm implementing the bijection (i 1 , . . . , i N ) → i, inverse to I(i).
Recall that for fixed I = N k=1 I k , J = N k=1 J k , and a set of ranks R = (R 1 , R 2 , . . . , R N −1 ) we defined M R , the set of all tensors represented in the TT-matrix format such that for any X ∈ M R we have TT-rank(X ) ≤ R, entry-wise. Let X denote an ordinary matrix of size N × M obtained from the TT-matrix X with the inverse of procedure decsribed in Section 4.2 (application of formulas from Section 4.1, followed by transposing and reshaping). Our analysis is based on the fact that M R forms an irreducible algebraic set (Buczyńska et al., 2015;Hartshorne, 2013). Concretely, we will use the fact that for an irreducible algebraic set A any algebraic subset B either has measure zero, or coincides with A. We start with a simple lemma. Proof. We need to show that B is cut out by polynomial equations on M R . This readily follows from the facts that mat(·) is a linear mapping, and that the upper bound on matrix rank can be specified by requiring all minors of specific size to vanish (which is a polynomial constraint).
We now show that B is in fact a proper subset of M R , i.e., B M R .
Lemma 2. For any M R there exists X ∈ M R with rank X = min(I, J).
Proof. We provide a concrete example of such a tensor. Define the collection of TT-cores with δ ij denoting the Kronecker delta symbol. It easy to verify that X of a tensor X specified by this collection of cores takes a very simple form: X[i, j] = δ ij , which clearly is of maximal rank.
Using Lemmas 1 and 2 and based on previous discussion on properties of algebraic sets we conclude that the following theorem holds.
Theorem 1. For all X ∈ M R besides a set of measure zero rank X = min(I, J), where the ordinary matrix rank is assumed.

C Tensor Ring Embedding
Tensor Ring (TR) decomposition is a generalization to TT-decomposition where the first and the last cores are 3-dimensional tensors which corresponds to R 0 = R N > 1. Formally, a tensor X is said to be represented in the TR format (Zhao et al., 2016) if each element of X can be computed as: G (1) (r 0 , i 1 , r 1 ) G (2) (r 1 , i 2 , r 2 ) . . . G (N ) (r N −1 , i N , r 0 ). Figure 3: Construction of the TR-matrix from the standard embedding matrix. Blue color depicts how the single element in the initial matrix is transformed into the product of the highlighted matrices. In contrast to TT-embedding, matrix trace operator is applied to the final matrix, resulting in a scalar (highlighted element). (4, 5, 5, 5, 6, 6) × (2, 2, 2, 2, 4, 4) 8 0.394 800 0.78M Similar to TT, we can define TR-matrix (see Figure 3) and corresponding TR-embedding layer.
While our results (Table 5 and Table 6) suggest that TT-embedding shows better compressionperformance trade-off than its TR counterpart, much more experimentation is needed to properly compare these two approaches (for example, we see that TR is a promising direction for future work as it outperforms TT on SST-2 benchmark). However, such analysis is computationally heavy and goes beyond the scope of this paper. Table 7 and Table 8 contain full lists of hyperparameters we used for training Transformer models for neural machine translation and language modeling respectively.