Transformation of Dense and Sparse Text Representations

Sparsity is regarded as a desirable property of representations, especially in terms of explanation. However, its usage has been limited due to the gap with dense representations. Most research progresses in NLP in recent years are based on dense representations. Thus the desirable property of sparsity cannot be leveraged. Inspired by Fourier Transformation, in this paper, we propose a novel Semantic Transformation method to bridge the dense and sparse spaces, which can facilitate the NLP research to shift from dense spaces to sparse spaces or to jointly use both spaces. Experiments using classification tasks and natural language inference task show that the proposed Semantic Transformation is effective.


Introduction
Many studies have shown that sparsity is a desirable property of representations, especially in terms of explanation (Fyshe et al., 2014;Faruqui and Dyer, 2015).In this sense, sparse representation may hold the key to solving the explainability problem of deep neural networks.Apart from the interpretability property, sparse representation can also improve the usability of word vectors as features (Guo et al., 2014;Chang et al., 2018).Several tasks have benefited * Equal Contribution.
† Corresponding Author
There are two key limitations in the existing studies of sparse representations.First, there is no study that has been done to connect dense and sparse spaces well, which makes the two types of representations relatively independent and cannot reinforce each other to achieve synergy.Second, limited work has been done to generate representations of sentences or phrases in the sparse space using sparse word embeddings.
Inspired by Fourier Transformation, as shown in Figure 1, this paper proposes a novel method called Semantic Transformation (ST) to address the problems.With the help of ST, dense and sparse spaces can connect with each other and will not be isolated anymore.The proposed transformation consists of two key components, namely, Semantic Forward Transformation (SFT) and Semantic Backward Transformation (SBT) (see Section 2).SFT is designed to transform a dense representation to a sparse representation.That is, we can transform any learned dense features to sparse representations and endow the model properties that sparsity possesses.Sparse representations can also be transformed back to dense representations through SBT, before that, we can perform different operations in the sparse space to achieve different goals.
Another key innovation of this paper is that it proposes a new approach for achieving sparseness.Conventionally, penalties are commonly used to achieve sparseness (Sun et al., 2016;Ng et al., 2011;Subramanian et al., 2018).However, they suffer from the problems of initialization sensitivity and uncontrollable optimization.In this paper, we propose to achieve sparseness through a novel activation function, which gives an effective solution (see Section 2.1).Experimental results show that the proposed activation function works very well.
In this paper, we also explore a combination method to combine words representations into sentence representations in the sparse space directly.1Additionally, the proposed transformations and combination method can be paralleled to enable efficient computation.
In summary, this paper makes the following contributions: • It proposes a novel semantic transformation method which effectively connects dense and sparse spaces.
• It proposes to use a new activation function to achieve sparseness, which, to the best of our knowledge, has not been used before.The function works very well.
• It proposes a combination method that can encode sentence in the sparse space directly.
• The proposed methods have been evaluated using text classification and natural language inference tasks with promising results.Since the proposed transformations avoid large scale matrix multiplications in the combination procedure, it is also efficient.

Semantic Transformation
In this section, we first briefly describe the composition of Semantic Transformation (ST), and then elaborate on each component.The proposed ST has three operations: 1. SFT (Semantic Forward Transformation).It takes a dense representation as input and transforms it into a higher dimensional sparse space.

SBT (Semantic Backward Transformation).
It is the inverse of SFT, transforming representations from the sparse space back to the dense space.
3. SCSS (Semantic Combination in the Sparse Space).It computes the sentence representation using its component word representations in the sparse space.2

Semantic Forward Transformation
SFT aims to discover the latent semantic aspects in a dense representation of word x and put them in a higher dimensional sparse representation y.
We assume M is the number of latent semantic aspects3 , and each latent semantic aspect is represented by a vector, i.e. b m ∈ R d for the m th base.We define all the latent semantic aspects as the bases of semantemes in the real world, denoted by B = {b 1 , . . ., b M } ∈ R d×M .Given B, the function of SFT is to estimate the semantic distribution of the given dense representation over B.
The reasons for giving positive and negative values to elements in a sparse representation are that 1) negative values can represent "negative semantemes"; 2) we can eliminate some meanings of elements (positive values) through simple operations between words, i.e., adding.Note that a negative value representing "negative semantics" of a given aspect does not mean that two words with opposite meanings have exactly corresponding positive and negative sparse representations.In the sparse space, we use the composition of semantemes to denote word meanings.This is in line with the human way of using words, e.g., the meaning of "not bad" can be obtained by adding the sparse representation of "not" to the sparse representation of "bad".In this sense, the meaning of "not bad" is a composition of several semantemes.

Formulation of SFT:
We adopt a multilayer perceptron (MLP) 4 integrated with the base B to build a SFT to perform its function.We first use a MLP f (•) to learn deep features of the dense representation x, and then use the features to compute the sparse distribution over the semantic bases.Formally, the i th layer in f (•) can be written as: where σ is the activation function, and w i is the parameter of the i th layer denoted by µ; p i−1 is the output of (i − 1) th layer and p 0 = x.We denote the output of the last layer of f (•) as p and then integrate it with B. The distribution over semantic bases can be computed by: where w f is a trainable parameter; S(•) is a specially designed activation function used to control the sparseness of the semantic distribution (discussed later).To sum up, SFT can be written as: Sparse Activation: Sparsity is enforced through penalties in most exist studies, such as ℓ 1 regularizer (Sun et al., 2016), average sparsity penalty (Ng et al., 2011), and partial sparsity penalty (Subramanian et al., 2018).We call those methods penalty enforcing methods which push the sparse representation close to either 0 or 1.However, such penalties suffer from the initialization sensitivity problem as the penalties contain an initial interface which influences the distribution of the learned sparse representation significantly.To overcome the problem, we propose to use an activation function instead.We first give the formulation of the proposed activation function S(•) and then show its activation curve in Figure 2(a).
4 Our approach is not limited to using multilayer perceptron (MLP).Other techniques, e.g., CNN may also be used.where β and γ are two hyper-parameters controlling the sparsity of the output.We set β = 1 and γ = 2 in our experiments.
Clearly, from Figure 2(a), we can see a large range of inputs of S(•) is mapped to 0, while the positions around ±γ/β will get high responses.Integrated with an objective loss function (depends on specific types of tasks, e.g., cross entropy for classification), SFT learns to give the relevant aspects/semantemes with predictions around ±γ/β.In the case under the action of this activation function, we can learn sparse representations through the original objective function, not relying on enforcing penalties.Based on the experimental results, we will see that this activation function works very well on many datasets.
S(•) is non-linear and differentiable and its derivatives can be written as: where Sign(•) is Sign function, and Sign(0) = 0. Clearly, the derivative of S(•) is easy to compute.

Semantic Backward Transformation
SBT is the inverse transformation of SFT, which transforms a sparse representation back to a dense representation.A straightforward way to achieve SBT is to use the sparse representation to do a weighted sum over the base B. To increase the fitting ability of SBT, similar to SFT, we adopt a MLP F (•) to learn a deep dense representation.
We formulate SBT as follows: where tanh is the Tanh activation function, and w b is a trainable parameter.F (•) is a MLP with its own trainable parameters.

Semantic Combination in Sparse Space
This section proposes a Semantic Elimination (SE) method to complete semantic combination in the sparse space. 5The main idea of SE is to use the negative values in the representation of one word to eliminate another word's semantics.That is also one of the reasons for defining negative values in the sparse representation.In this scenario, the sparse representation has two functions: (1) using positive values (positive semantemes) to denote which semantic meanings a word has and (2) using negative values (negative semantemes) to eliminate the semantemes that should not be present in the word.Below, we detail SE.
Due to the fact that a word's semantemes usually change with the nearby words or just the preceding word in a sentence, given a sentence, we propose to use the i th word's negative values to eliminate the (i + 1) th word's positive values (semantemes).We call this elimination method Preceding Elimination (PE).After that, a nonlinear activation function must be followed to avoid the overall operation as a linear operation.Note, the activation function must go through the origin (0, 0) in order to ensure the balance of positive and negative values.In this case, we specially designed an activation function, which we will elaborate it shortly.Then we add the sparse representations of all words in the sentence together after PE as the final sentence sparse representation. 6e designed an activation function, called 'leaky' (its curve is shown in Figure 2(b)) to (1) decrease the small values of a sparse representation in order to prevent the system from producing new semantemes that shouldn't exist; (2) make the SE sensitive to word order (in order to consider the information of word order) since the activation function is non-linear which enables noncommutativity of the whole SE over linear and non-linear operations.Note that 'leaky' is used on sparse representations of words after preceding elimination.SE is formulated as: where s t is the sparse representation of a subsentence from 1 to position t produced by semantic combination in the sparse space.In this case, s T denotes the sparse representation of a sentence with length T .

Objective Function
Overall, given a batch of data D, our model is trained to minimize the following objective function: (8) where PL(D) denotes the prediction loss over the dataset, it depends on the task that the model is applied to; ML(D) denotes the margin loss, it is performed to enlarge the margin of distances between sparse representations with different meanings; BL(D) is a regularization used to constrain the norm of bases; RL o (D) denotes the reconstruction loss, which is used to do model simulation (see below) and therefore it is optional.Note, when applying our method only PL(D) is necessary, ML(D) and BL(D) can be used to improve the model's performance.Next, we discuss these loss functions.
Prediction Loss (PL): PL(D) is the training loss of the application task.For example, in our case, this loss is Cross Entropy for supervised classification.
Margin Loss (ML): ML(D) is designed to enlarge the margin of distances between sparse representations with different meanings.We need ML to help training because we found that the margin of the learned sparse representation by optimizing PL is not clear or significant for separating positive and negative semantemes, which is undesirable for explanation.We then explore a new method for clear sparse representation learning, called Margin Loss, which makes the sparse representations having different meanings far from each other.
In the scenario of classification, we leverage the class labels as supervising information to group the samples in a batch into each class, and then average the sparse representations of the instances in each class to represent the class.Formally, we assume y ci is the averaged representation of the i th class.Then, based on the cosine similarity 7 , we define ML(D) as follows: where Y c = {y 1 , . . ., y N }, N is the number of classes.⊙ denotes Hadamard product.W ∈ R N ×N is hyper-parameter used to control the updating direction and degree.W ij is set to -1 if i = j, or 1 otherwise.This ensures a large margin between different classes by minimizing their inner product.Note that in some scenarios, especially sentiment classification, the distance of different classes belonging to the same positive (or negative) sentiment (e.g., strong and weak positive/negative classes) should not be enlarged much.
In this case, we develop an exponential decay function to intuitively set W: where τ is half-life, we set it to (N − 1)/2.
Base Regularization: (BL): Recall in the proposed semantic forward transformation method, base collection B is the key for obtaining the semantic distribution (semantic representation) of the given dense representation.Clearly, it is a projection procedure.Here, we argue that a larger projection will not ensure a better prediction.That is because representations with a large norm usually get a large projection, which is a point that conventional prediction methods ignore.The proposed Sparse Activation method eliminates this problem by giving large projections small responses.Similarly, inconsistent length of bases in B will cause different output (response) priors.To tackle this problem, we propose a base regularization to constrain the length of bases in B to equal to 1. Formally, BL is formulated as: where b m is the m th base in B.
7 Note that yci is not a sparse representation as it is the average of many sparse representations.Cosine similarity is appropriate for Margin Loss.
Reconstruction Loss (RL o ): The proposed ST can easily do transformations among dense and sparse spaces, and learn sentence representation in the sparse space.In this case, ST could provide a sentence with both dense and sparse representations.One question that may be asked is whether the dense representations produced by ST through back transform can be used in place of dense representations directly learned by models in the dense space, e.g., LSTM?In this case, we propose reconstruction loss to minimize the construction error between the outputs of ST and LSTM.Another purpose of RL o (D) is to control the meanings of the same word or sentence/phrase in different spaces to maintain consistency with the representations of a sentence and its phrases produced by LSTM as X, then T is the length of the sentence.x, y, s i have the same meanings as we defined before.X ′ i denotes the dense representation constructed from sparse representation s i .This loss helps transform the representations in one space to another space while maintaining the semantic information consistency.The last term helps learn similar representations with LSTM.

Experiments
We evaluate the proposed method using one natural language inference dataset and four text classification datasets.The tasks act as good quality checks for the learned representations.The code is implemented with Pytorch and can be found here 8The five datasets are SNLI, MR, SST1, SST2 and TREC, detailed training/dev/test splits are shown on Table 2: • SNLI (Bowman et al., 2015): a collection of human-written English sentence pairs manually labeled for balanced classification with the labels: entailment, contradiction, and neutral.This is the natural language inference dataset, which is also solved via classification.
• MR v1.09 : Movie reviews with one sentence Table 1: Average accuracy over all tasks.Y and X' are representations for making predictions (X' is the back transformation of Y; Y is the sparse representation).Helper loss refers to ML or BL.Note that only the experiments using X' as the representations for prediction has RL o .RL o is not used when using Y as the prediction feature.
• SST1 10 : an extension of MR but with finegrained labels: very positive, positive, neutral, negative, very negative.
• SST2 11 : same as SST1 but with neutral reviews removed and only using positive and negative labels.
• TREC 12 : question samples that classify each question into one of 6 question types: about person, location, numeric information, etc.
For our model, we adopt a MLP with 1 hidden layer (300 units) for forward transform and a MLP with 2 hidden layers (300 units) for backward transform.We set the length of semantic base to 1000.
Training details: We adopt uniform settings for all baselines and our model: 1) Adam optimizer for parameter updating with learning rate of 1e-4; trainable embeddings with size 300.
2) A MLP with 1 hidden layer as the classifier.For a fair comparison, the hidden unit size is set to 300 for LSTM, CNN, Transformer and Capsule.For our model, it is set to 64 when we use sparse representation to do the prediction and still 300 when we use back transformation representations as the prediction features.133) SNLI is the task of identifying the relationships between two given sentences.For each model, we first use it to encode the two sentences into the resulting representations respectively, and then concatenate the two sentence representations for the final prediction.4) We report the average accuracy over 10 runs of the experiment on the test data.For each run, the maximum accuracy before early stopping is selected as the result of the current run.

Results and Analysis
Table 1 shows the prediction accuracy of our model and the baselines.Table 3 gives the prediction run time.From Table 1 and 3, we can make the following observations: • The proposed Semantic Transform (ST) approach significantly outperforms LSTM on three datasets: SST1, SST2 and MR, and get comparable results with LSTM on SNLI and TREC.ST also markedly outperforms Transformer and Capsule on all five datasets, and outperforms CNN on four out of five datasets.Therefore, we can draw the conclusions that ST is an effective method to learn sentence representations in both dense and sparse spaces.
) performs much worse than the proposed sparse activation method, which indicates the effectiveness of the proposed method.ST ‡ (including ST ‡ [X'] and ST ‡ [Y]) shows the proposed sparse activation plays an important role in our system, and it's very effective.And we will show that the proposed sparse activation method can ensure good sparseness of the representation through the analysis below.The relatively worse results of ST ‡ (including ST ‡ [X'] and ST ‡ [Y]) also confirmed the effectiveness of helper losses.
• In terms of efficiency, Table 3 shows that ST is 2-3 times faster than LSTM.ST is also markedly faster than Capsule and Transformer on all datasets.CNN is known as the fastest model and our method achieves comparable speeds with CNN.
In summary, considering that our work is only the first attempt, it performs quite well compared with highly researched and optimized LSTM, CNN, Capsule and Transformer models.We foresee that future work will significantly optimize our method.
Sparsity Analysis: Figure 3 shows the sparsity of the word sparse representations of the 5 datasets.Sparsity is evaluated using the following Sparse Evaluation (SE) function.We proposed this method because previous methods were not designed for sparse representations with both positive and negative values: As function (sin(πy)) 2 has only three minimum points, -1, 0, 1, it is suitable for measuring the concentration degree of the components of sparse representations.Figure 3 shows a clear decline of SPLoss, which indicates a high concentration degree.Table 4 also gives the statistics about the distributions of the sparse representations.We can see that 'zero' (V < 0.05) takes a large portion of the sparse representations, which is desirable.We can conclude that the learned sparse representations are indeed sparse.Accuracy of Transformation: We asked a question about the ability of ST to construct LSTM when we introduced the RL o .Here, we analyze the transformation accuracy of the proposed method and give a positive answer to that question.From Table 1, we can see that ST[X'] achieves very similar results to those of LSTM.From the results, we can draw the conclusion that the dense representation generated by ST through backward transformation can achieve very similar results to those of LSTM.Further, we propose a measure to gauge the construction accuracy, named Construction Accuracy Metric (CAM), to evaluate the accuracy of transformation.CAM is formulated as the following function (results are shown in Figure 4): ) where X ij is the original dense representation of a sub-sentence (generated by LSTM) and X ′ ij is the backward transformation result of its sparse representation; C denotes the test set, and J is the length of the sentence.Clearly, this function can evaluate the similarity between X and X ′ as CAM will raise with the increasing of distance between X and X ′ .Figure 4 shows that the difference between X and X ′ is only about 5%.Therefore, we can conclude that our model can construct the outputs of LSTM well.
Interpretability Analysis: Interpretability is one of the most desirable properties of sparse representations.Figure 5 shows the average sparse representation of five classes (tested on the test set of SST1) with different sentiment polarities (-2, -1, 0, 1, 2).Positive numbers refer to positive sentiment, and negative numbers refer to negative sentiment.In order to clearly visualize the differences in the learned representations over the five classes, we sort the bases based on the ascending order of the sparse representation values of +2 (very positive) class.
From Figure 5, we can see that there is a clear color difference for sentiment polarity class +2 and class -2.We can also see a similar phenomenon for sentiment polarity class +1 and class -1 but less pronounced as the their polarities are more similar.These observations demonstrate that the same bases obtain opposite values for classes of opposite sentiments.The bases generating distinct responses for classes with different sentiment polarities can be regarded as primary sentiment bases as they clearly indicate the semantic differences of the classes.In other words, the primary sentiment bases can be explained as sentiment bases.For example, the bases give positive response to positive classes but negative responses to negative classes are the positive sentiment bases, which directly indicate the sentiment polarities.
Comparing with positive and negative classes, neutral class shows relative mixed responses.That means neutral class has similar semantemes to those of both positive and negative classes.This demonstrates that the neutral class is more difficult to identify.
Several sparse models have been proposed to produce sparse embeddings.For example, some previous works trained word embeddings with sparse or non-negative constraints (Murphy et al., 2012;Luo et al., 2015).Linguistically inspired dimensions (Faruqui et al., 2015) is another way to increase sparsity and interpretability.SPINE (SParse Interpretable Neural Embeddings) (Subramanian et al., 2018), a variant of denoising k-sparse autoencoder, can generate efficient and interpretable distributed word representations.Our method is different from these approaches.We not only construct sparse representations but also transform between dense and sparse spaces.We also combine word sparse representations to produce sentence representations.Some recent studies tried to achieve sparsity in novel ways (Park et al., 2017).We also proposed a novel method in this paper and experimentally verified its effectiveness.

Conclusion and Future Works
This paper proposed a novel method to transform representations between dense and sparse spaces, and a technique to combine semantics in the sparse space.It also proposed and experimentally verified a new activation function that can be used to achieve sparseness.Natural language inference and text classification tasks were used to evaluate the proposed transformations with promising results.Based on this study, many other interesting directions can be pursued in the future, e.g., (1) As we discussed in the paper, the proposed method can construct the output of LSTM well.One future work is to apply ST to language modeling.In this case, the results can be used in many down stream tasks such as machine translation and dialogue systems.
(2) With the help of ST, we can investigate the style transfer on similar tasks in the sparse space by direct semantic reversing.Also, we can use ST to filter out noises or undesirable information.
(3) Based on sparse representations, we can also explore semantic pattern recognition and transformation.

Figure 1 :
Figure 1: Transformations between dense and sparse spaces.SFT and SBT denote the forward and backward transformation respectively.

Figure 3 :
Figure 3: Sparsity evaluation of sparse word representations (the legend is explained below)

Figure 4 :
Figure 4: Evaluation of the construction of X.

Table 2 :
Summary statistics for the datasets after tokenization.c denotes the number of target classes.

Table 3 :
Average running time over all test sets (Minute)