Improving Disentangled Text Representation Learning with Information-Theoretic Guidance

Learning disentangled representations of natural language is essential for many NLP tasks, e.g., conditional text generation, style transfer, personalized dialogue systems, etc. Similar problems have been studied extensively for other forms of data, such as images and videos. However, the discrete nature of natural language makes the disentangling of textual representations more challenging (e.g., the manipulation over the data space cannot be easily achieved). Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text, without any supervision on semantics. A new mutual information upper bound is derived and leveraged to measure dependence between style and content. By minimizing this upper bound, the proposed method induces style and content embeddings into two independent low-dimensional spaces. Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation in terms of content and style preservation.


Introduction
Disentangled representation learning (DRL), which maps different aspects of data into distinct and independent low-dimensional latent vector spaces, has attracted considerable attention for making deep learning models more interpretable. Through a series of operations such as selecting, combining, and switching, the learned disentangled representations can be utilized for downstream tasks, such as domain adaptation , style transfer (Lee et al., 2018), conditional generation (Denton et al., 2017;Burgess et al., 2018), and few-shot learning (Kumar Verma et al., 2018). Although widely used in various domains, such * This work was conducted while the first author was doing an internship at NEC Labs America. as images (Tran et al., 2017;Lee et al., 2018), videos (Yingzhen and Mandt, 2018;Hsieh et al., 2018), and speech (Chou et al., 2018;Zhou et al., 2019), many challenges in DRL have received limited exploration in natural language processing (John et al., 2019).
To disentangle various attributes of text, two distinct types of embeddings are typically considered: the style embedding and the content embedding (John et al., 2019). The content embedding is designed to encapsulate the semantic meaning of a sentence. In contrast, the style embedding should represent desired attributes, such as the sentiment of a review, or the personality associated with a post. Ideally, a disentangled-text-representation model should learn representative embeddings for both style and content.
To accomplish this, several strategies have been introduced. Shen et al. (2017) proposed to learn a semantically-meaningful content embedding space by matching the content embedding from two different style domains. However, their method requires predefined style domains, and thus cannot automatically infer style information from unlabeled text. Hu et al. (2017) and Lample et al. (2019) utilized one-hot vectors as style-related features (instead of inferring the style embeddings from the original data). These models are not applicable when new data comes from an unseen style class. John et al. (2019) proposed an encoderdecoder model in combination with an adversarial training objective to infer both style and content embeddings from the original data. However, their adversarial training framework requires manually-processed supervised information for content embeddings (e.g., reconstructing sentences with manually-chosen sentiment-related words removed). Further, there is no theoretical guarantee for the quality of disentanglement.
In this paper, we introduce a novel Information-theoretic Disentangled Embedding Learning method (IDEL) for text, based on guidance from information theory. Inspired by Variation of Information (VI), we introduce a novel informationtheoretic objective to measure how well the learned representations are disentangled. Specifically, our IDEL reduces the dependency between style and content embeddings by minimizing a sample-based mutual information upper bound. Furthermore, the mutual information between latent embeddings and the input data is also maximized to ensure the representativeness of the latent embeddings (i.e., style and content embeddings). The contributions of this paper are summarized as follows: • A principled framework is introduced to learn disentangled representations of natural language. By minimizing a novel VI-based DRL objective, our model not only explicitly reduces the correlation between style and content embeddings, but also simultaneously preserves the sentence information in the latent spaces.
• A general sample-based mutual information upper bound is derived to facilitate the minimization of our VI-based objective. With this new upper bound, the dependency of style and content embeddings can be decreased effectively and stably.
• The proposed model is evaluated empirically relative to other disentangled representation learning methods. Our model exhibits competitive results in several real-world applications.

Mutual Information Variational Bounds
Mutual information (MI) is a key concept in information theory, for measuring the dependence between two random variables. Given two random variables x and y, their MI is defined as where p(x, y) is the joint distribution of the random variables, with p(x) and p(y) representing the respective marginal distributions. In disentangled representation learning, a common goal is to minimize the MI between different types of embeddings (Poole et al., 2019). However, the exact MI value is difficult to calculate in practice, because in most cases the integral in Eq. (1) is Figure 1: The green and purple circles represent the entropy of x and y, respectively. The intersection (blue region) is the mutual information between x and y. The symmetric difference of the two circles (green and purple regions) is VI(x; y).
intractable. To address this problem, various MI estimation methods have been introduced (Chen et al., 2016;Belghazi et al., 2018;Poole et al., 2019). One of the commonly used estimation approaches is the Barber-Agakov lower bound (Barber and Agakov, 2003). By introducing a variational distribution q(x|y), one may derive where is the entropy of variable x.

Variation of Information
In information theory, Variation of Information (VI, also called Shared Information Distance) is a measure of independence between two random variables. The mathematical definition of VI between random variables x and y is where H(x) and H(y) are entropies of x and y, respectively (shown in Figure 1). Kraskov et al. (2005) show that VI is a well-defined metric, which satisfies the triangle inequality: for any random variables x, y and z. Additionally, VI(x; y) = 0 indicates x and y are the same variable (Meilȃ, 2007). From Eq. (3), the VI distance has a close relation to mutual information: if the mutual information is a measure of "dependence" between two variables, then the VI distance is a measure of "independence" between them.

Method
where each x i is a sentence drawn from a distribution p(x), and y i is the label indicating the style of x i . The goal is to encode each sentence x i into its corresponding style embedding s i and content embedding c i with an encoder q θ (s, c|x): The collection of style embeddings {s i } N i=1 can be regarded as samples drawn from a variable s in the style embedding space, while the collection of content embeddings {c i } N i=1 are samples from a variable c in the content embedding space. In practice, the dimension of the content embedding is typically higher than that of the style embedding, considering that the content usually contains more information than the style (John et al., 2019).
We first give an intuitive introduction to our proposed VI-based objective, then in Section 3.1 we provide the theoretical justification for it. To disentangle the style and content embedding, we try to minimize the mutual information I(s; c) between s and c. Meanwhile, we maximize I(c; x) to ensure that the content embedding s sufficiently encapsulates information from the sentence x. The embedding s is expected to contain rich style information. Therefore, the mutual information I(s; y) should be maximized. Thus, our overall disentangled representation learning objective is: L Dis = I(s; c) − I(c; x) − I(s; y).

Theoretical Justification of the Objective
The objective L Dis has a strong connection with the independence measurement in information theory. As described in Section 2.2, Variation of Information (VI) is a well-defined metric of independence between variables. Applying the triangle inequality from Eq. (4) to s, c and x, we have VI(s; x) + VI(x; c) ≥ VI(s; c). Equality occurs if and only if the information from variable x is totally separated into two independent variable s and c, which is an ideal scenario for disentangling sentence x into its corresponding style embedding s and content embedding c. Therefore, the difference between VI(s; x) + VI(x; c) and VI(s; c) represents the degree of disentanglement. Hence we introduce a measurement:  Since H(x) is a constant associated with the data, we only need to focus on I(s; c) − I(x; c) − I(x; s).
The measurement D(x; s, c) is symmetric to style s and content c, giving rise to the problem that without any inductive bias in supervision, the disentangled representation could be meaningless (as observed by Locatello et al. (2019)). Therefore, we add inductive biases by utilizing the style label y as supervised information for style embedding s. Noting that s → x → y is a Markov Chain, we have I(s; x) ≥ I(s; y) based on the MI data-processing inequality (Cover and Thomas, 2012). Then we convert the minimization of I(s; c) − I(x; c) − I(x; s) into the minimization of the upper bound I(s; c)−I(x; c)−I(y; s), which further leads to our objective L Dis .
However, minimizing the exact value of mutual information in the objective L Dis causes numerical instabilities, especially when the dimension of the latent embeddings is large (Chen et al., 2016). Therefore, we provide several MI estimations to the objective terms I(s; c), I(x; c) and I(s; y) in the following two sections.

MI Variational Lower Bound
To maximize I(x; c) and I(s; y), we derive two variational lower bounds. For I(x; c), we introduce a variational decoder q φ (x|c) to reconstruct the sentence x by the content embedding c. Leveraging the MI variational lower bound from Eq. (2), Similarly, for I(s; y), another variational lower bound can be obtained as: I(s; y) ≥ H(y) + E p(y,s) [log q ψ (y|s)], where q ψ (y|s) is a classifier mapping the style embedding s to its corresponding style label y. Based on these two lower bounds, L Dis has an upper bound: (6) Noting that both H(x) and H(y) are constants from the data, we only need to minimize: As an intuitive explanation ofL Dis , the style embedding s and content embedding c are expected to be independent by minimizing mutual information I(s; c), while they also need to be representative: the style embedding s is encouraged to give a better prediction of style label y by maximizing E p(y,s) [log q ψ (y|s)]; the content embedding should maximize the log-likelihood E p(x,c) [log q φ (x|c)] to contain sufficient information from sentence x.

MI Sample-based Upper Bound
To estimate I(s; c), we propose a novel samplebased upper bound. Assume we have M latent embedding pairs {(s j , c j )} M j=1 drawn from p(s, c). As shown in Theorem 3.1, we derive an upper bound of mutual information based on the samples. A detailed proof is provided in the Supplementary Material.
log p(s j |c k ). Based on Theorem 3.1, given embedding samples {s j , c j } M j=1 , we can minimize 1 M M j=1 R j as an unbiased estimation of the upper boundÎ(s; c). The calculation of R j requires the conditional distribution p(s|c), whose closed form is unknown. Therefore, we use a variational network p σ (s|c) to approximate p(s|c) with embedding samples.
To implement the upper bound in Eq. (8), we first feed M sentences {x j } into encoder q θ (s, c|x) to obtain embedding pairs {(s j , c j )}. Then, we train the variational distribution p σ (c|x) by maximizing the log-likelihood L(σ) = 1 M M j=1 log p σ (s j |c j ). After the training of p σ (s|c) is finished, we calculate R j for each embedding pair (s j , c j ). Finally, the gradient for 1 M M j=1 R j is calculated and backpropagated to encoder q θ (s, c|x). We apply the reparameterization trick (Kingma and Welling, 2013) to ensure the gradient back-propagates through the sampled embeddings (s j , c j ). When the encoder weights are updated, the distribution q θ (s, c|x) changes, which leads to the changing of conditional distribution p(s|c). Therefore, we need to update the approximation network p σ (s|c) again. Consequently, the encoder network q θ (s, c|x) and the approximation network p σ (s|c) are updated alternately during training.
In each training step, the above algorithm requires M pairs of embedding samples {s j , c j } M j=1 and the calculation of all conditional distributions p σ (s j |c k ). This leads to O(M 2 ) computational complexity. To accelerate the training, we further approximate term 1 This stochastic sampling not only leads to an unbiased estimationR j to R j , but also improves the model robustness (as shown in Algorithm 1).
Symmetrically, we can also derive an MI upper bound based on the conditional distribution p(c|s). However, the dimension of c is much higher than the dimension of s, which indicates that the neural approximation to p(c|s) would have worse performance compared with the approximation to p(s|c). Alternatively, the lower-dimensional distribution p(s|c) used in our model is relatively easy to approximate with neural networks.

Encoder-Decoder Framework
One important downstream task for disentangled representation learning (DRL) is conditional generation. Our MI-based text DRL method can be also embedded into an Encoder-Decoder generative model and trained end-to-end.
Since the proposed DRL encoder q θ (s, c|x) is a stochastic neural network, a natural extension is to add a decoder to build a variational autoencoder (VAE) (Kingma and Welling, 2013). Therefore, we introduce another decoder network p γ (x|s, c) that generates a new sentence based on the given style s and content c. A prior distribution p(s, c) = p(s)p(c), as the product of two multivariate unitvariance Gaussians, is used to regularize the posterior distribution q θ (s, c|x) by KL-divergence minimization. Meanwhile, the log-likelihood term for text reconstruction should be maximized. The objective for VAE is: We combine the VAE objective and our MI-based disentanglement term to form an end-to-end learning framework (as shown in Figure 2). The total The style embedding s goes through a classifier q ψ (y|s) to predict the style label y; the content embedding c is used to reconstruct x. An auxiliary network p σ (s|c) helps disentangle the style and content embeddings. The decoder p γ (x|s, c) generates sentences based on the combination of s and c.

Mutual Information Estimation
Mutual information (MI) is a fundamental measurement of the dependence between two random variables. MI has been applied to a wide range of tasks in machine learning, including generative modeling (Chen et al., 2016), the information bottleneck (Tishby et al., 2000), and domain adaptation (Gholami et al., 2020). In our proposed method, we utilize MI to measure the dependence between content and style embedding. By minimizing the MI, the learned content and style representations are explicitly disentangled. However, the exact value of MI is hard to calculate, especially for high-dimensional embedding vectors (Poole et al., 2019). To approximate MI, most previous work focuses on lower-bound estimations (Chen et al., 2016;Belghazi et al., 2018;Poole et al., 2019), which are not applicable to MI minimization tasks. Poole et al. (2019) propose a leave-one-out upper bound of MI; however it is not numerically stable in practice. Inspired by these observations, we introduce a novel MI upper bound for disentangled representation learning, which stably minimizes the correlation between content and style embedding in a principled manner.

Datasets
We conduct experiments to evaluate our models on the following real-world datasets: Yelp Reviews: The Yelp dataset contains online service reviews with associated rating scores. We follow the pre-processing from Shen et al. (2017) for a fair comparison. The resulting dataset includes 250,000 positive review sentences and 350,000 negative review sentences.
Personality Captioning: Personality Captioning dataset (Shuster et al., 2019) collects captions of images which are written according to 215 different personality traits. These traits can be divided into three categories: positive, neutral, and negative. We select sentences from positive and negative classes for evaluation.

Experimental Setup
We build the sentence encoder q θ (s, c|x) with a one-layer bi-directional LSTM plus a multi-head attention mechanism. The style classifier q ψ (y|s) is parameterized by a single fully-connected network with the softmax activation. The content-based decoder q φ (x|c) is a one-layer uni-directional LSTM  appended with a linear layer with vocabulary size output, outputting the predicted probability of the next words. The conditional distribution approximation p σ (s|c) is represented by a two-layer fullyconnected network with ReLU activation. The generator p γ (x|s, c) is built by a two-layer unidirectional LSTM plus a linear projection with output dimension equal to the vocabulary size, providing the next-word prediction based on previous sentence information and the current word.
We initialize and fix our word embeddings by the 300-dimensional pre-trained GloVe vectors (Pennington et al., 2014). The style embedding dimension is set to 32 and the content embedding dimension is 512. We use a standard multivariate normal distribution as the prior of the latent spaces. We train the model with the Adam optimizer (Kingma and Ba, 2014) with initial learning rate of 5 × 10 −5 . The batch size is equal to 128.

Embedding Disentanglement Quality
We first examine the disentangling quality of learned latent embeddings, primarily studying the latent spaces of IDEL on the Yelp dataset.
Latent Space Visualization: We randomly select 1,000 sentences from the Yelp testing set and visualize their latent embeddings in Figure 3, via t-SNE plots (van der Maaten and Hinton, 2008). The blue and red points respectively represent the positive and negative sentences. The left side of the figure shows the style embedding space, which is well separated into two parts with different colors. It supports the claim that our model learns a semantically meaningful style embedding space. The right side of the figure is the content embedding space, which cannot be distinguished by the style labels (different colors). The lack of difference in the pattern of content embedding also provides evidence that our content embeddings have little correlation with the style labels.
For an ablation study, we train another IDEL model under the same setup, while removing our MI upper boundÎ(s; c). We call this model IDEL − in the following experiments. We encode the same sentences used in Figure 3, and display the corresponding embeddings in Figure 4. Compared with results from the original IDEL, the style embedding space (left in Figure 4) is not separated in a clean manner. On the other hand, the positive and negative embeddings become distinguishable in the content embedding space. The difference between Figures 3 and 4 indicates the disentangling effectiveness of our MI upper boundÎ(s; c).

Label-Embedding Correlation:
Besides visualization, we also numerically analyze the correlation between latent embeddings and style labels. Inspired by the statistical two-sample test (Gretton et al., 2012), we use the samplebased divergence between the positive embedding distribution p(c|y = 1) and the negative embedding distribution p(c|y = 0) as a measurement of label-embedding correlation. We consider four divergences: Mean Absolute Deviation (MAD) (Geary, 1935), Energy Distance (ED) (Sejdinovic et al., 2013), Maximum Mean Discrepancy (MMD) (Gretton et al., 2012), and Wasserstein distance (WD) (Ramdas et al., 2017). For a fair comparison, we re-implement previous text embedding methods and set their content embedding dimension to 512 and the style embedding dimension to 32 (if applicable). Details about the divergences and embedding processing are shown in the Supplementary Material.
From Table 2, the proposed IDEL achieves the lowest divergences between positive and negative content embeddings compared with Ctrl-Gen (Hu et al., 2017), CAAE (Shen et al., 2017), ARAE (Zhao et al., 2018)   only prior method that infers the text style embeddings. Table 3 shows a larger distribution gap between positive and negative style embeddings with IDEL than with DRLST, which demonstrates the proposed IDEL has better style information expression in the style embedding space. The comparison between IDEL and IDEL − supports the effectiveness of our MI upper bound minimization.

Embedding Representation Quality
To show the representation ability of IDEL, we conduct experiments on two text-generation tasks: style transfer and conditional generation. For style transfer, we encode two sentences into a disentangled representation, and then combine the style embedding from one sentence and the content embedding from another to generate a new sentence via the generator p γ (x|s, c). For conditional generation, we set one of the style or content embeddings to be fixed and sample the other part from the latent prior distribution, and then use the combination to generate text. Since most previous work only embedded the content information, for fair comparison, we mainly focus on fixing style and sampling context embeddings under the conditional generation setup.
To measure generation quality for both tasks, we test the following metrics (more specific description is provided in the Supplementary Material).
Style Preservation: Following previous work (Hu et al., 2017;Shen et al., 2017;John et al., 2019), we pre-train a style classifier and use it to test whether a generated sentence can be categorized into the correct target style class.
Content Preservation: For style transfer, we measure whether a generation preserves the content information from the original sentence by the self-BLEU score (Zhang et al., 2019(Zhang et al., , 2020. The self-BLEU is calculated between one original sentence and its style-transferred sentence. Generation Quality: To measure the generation quality, we calculate the corpus-level BLEU score (Papineni et al., 2002) between a generated sentence and the testing data corpus.
Geometric Mean: We use the geometric mean (GM) (John et al., 2019) of the above metrics to obtain an overall evaluation metric of representiveness of DRL models.
We compare our IDEL with previous state-ofthe-art methods on Yelp and Personality Captioning datasets, as shown in Table 1. The references to the other models are mentioned in Section 5.3. Note that the original BackTranslation (BT) method (Lample et al., 2019) is a Auto-Encoder framework, that is not able to do conditional generation. To compare with BT fairly, we add a standard Gaussian prior in its latent space to make it a variational auto-encoder model.
From the results in Table 1, ARAE performs well on the conditional generation. Compared to ARAE, our model performance is slightly lower on content preservation (BLEU). In contrast, the style

Content Source
Style Source Transferred Result I enjoy it thoroughly! never before had a bad experience at the habit until tonight. I dislike it thoroughly. quality is just so so.
quality is so bad. I am so grateful.
I am so disgusted.
never before had a bad experience at the habit until tonight. I am so grateful. never had a service that was enjoyable experience tonight. quality is just so so.
never had a unimpressed experience until tonight. quality of food is fantastic.
never had awesome routine until tonight.
I am so disappointed with palm today. we were both so impressed. I am so impressed with palm again. quality of food is fantastic .
I am good with palm today. never before had a bad experience at the habit until tonight. I am so disgusted with palm today.  classification score of IDEL has a large margin above that of ARAE. The BackTranslation (BT) has a better performance on style transfer tasks, especially on the Yelp dataset. Our IDEL has a lower style classification accuracy (ACC) than BT on the style transfer task. However, IDEL achieves high BLEU on style transfer, which leads to a high overall GM score on the Personality-Captioning dataset. On the Yelp dataset, IDEL also has a competitive GM score compared with BT. The experiments show a clear trade-off between style preservation and content preservation, in which our IDEL learns more representative disentangled representation and leads to a better balance. Besides the automatic evaluation metrics mentioned above, we further test our disentangled representation effectiveness by human evaluation. Due to the limitation of manual effort, we only evaluate the style transfer performance on Yelp datasets. The generated sentences are manually evaluated on style accuracy (SA), content preservation (CP), and sentence fluency (SF). The CP and SF scores are between 0 to 5. Details are provided in the Supplementary Material. Our method achieves better style and content preservation, with a little performance sacrifice on sentence fluency.  renders the sentence with more detailed style information (e.g., the degree of the sentiment). In addition, we conduct an ablation study to test the influence of different objective terms in our model. We re-train the model with different training loss combinations while keeping all other setups the same. In Table 1, IDEL surpasses IDEL − (without MI upper bound minimization) with a large gap, demonstrating the effectiveness of our proposed MI upper bound. The vanilla VAE has the best generation quality. However, its transfer style accuracy is slightly better than a random guess. When adding I(s; y), the ACC score significantly improves, but the content preservation (S-BLEU) becomes worse. When adding I(c; x), the content information is well preserved, while the ACC even decreases. By gradually adding MI terms, the model performance becomes more balanced on all the metrics, with the overall GM monotonically increasing. Additionally, we test the influence of the stochastic calculation of R j in Algorithm 1 (IDEL) with the closed form from Theorem 3.1 (IDEL * ). The stochastic IDEL not only accelerates the training but also gains a performance improvement relative to IDEL * .

Conclusions
We have proposed a novel information-theoretic disentangled text representation learning framework. Following the theoretical guidance from information theory, our method separates the textual information into independent spaces, constituting style and content representations. A sample-based mutual information upper bound is derived to help reduce the dependence between embedding spaces. Concurrently, the original text information is well preserved by maximizing the mutual information between input sentences and latent representations. In experiments, we introduce several two-sample test statistics to measure label-embedding correlation. The proposed model achieves competitive performance compared with previous methods on both conditional generation and style transfer. For future work, our model can be extended to disentangled representation learning with non-categorical style labels, and applied to zero-shot style transfer with newly-coming unseen styles.
which is what we claim in Theorem 3.1.
The inequality is based on the fact that the KLdivergence is always non-negative. The lower bound for I(s; y) can be also derived in the similar way.

B Sample-based Embedding Divergences
In this section we introduce the implementation details of the calculation about label-embedding correlation. As mentioned in Section 5.4 , the distribution divergence between p(c|y = 0) and p(c|y = 1) measures the correlation between content embeddings and style labels. Assume c 1 , c N 1 ∼ p(c|y = 1), then the four metrics MAD, ED, WD, MMD are calculated based on the two groups of samples. With a ground distance d(·, ·), the implementaion of the above four metrics are demonstrated in following: where K(·, ·) is a kernel function. Here we choose K(·, ·) from RBF kernel family with bandwidth w = 1.
For style embedding, the calculation formats are the same as in above equations. The style embeddings and content embeddings have different dimensions, which leads to the ground metric d(·, ·) inconsistent. Therefore, instead of using Euclidean distance, we use the cosine distance as the ground metric.

C Detailed Experimental Setups
We set the dimension of style embedding to be smaller than the content embedding, because the content carries more information than the style of sentences. The hyper-parameter β in our loss function is a formal expression of re-weighting the two objectives of disentanglement and autoencoding. In practice, we vary it from 0 to 1 with step 0.1 during the first 10 training epochs. At the beginning of the training, the output latent embeddings are not representative enough. Therefore, we choose a small weight on the disentanglement term to avoid obstructing the learning of representative embeddings. After the latent embedding is sufficiently trained, which can successfully reconstruct the input sentences, we slowly enlarge β for the disentanglement. After β reaches 1, we fix it until all the training epochs are finished.

D Details in Representation Quality Evaluation
For style preservation, we pretrain a style classifier on each dataset. The style classifier is built by a one-layer LSTM appended with a multi-head attention layer. The number of the attention head is set to 6. The classifiers reach 95% prediction accuracy on Yelp and 93% prediction accuracy on Personality-Captioning. We input transferred sentences into the classifier and test whether the predicted style label is the same as the target style label. For human evaluation, we transferred 1000 sentences with randomly selected style labels. After the transferring, we ask 10 human annotators to justify the style label, content preservation and content fluency. The style label is 0 or 1 representing the positive or negative sentiment of the given sentence. The content preservation and the content fluency is scored between 0 to 5. To make the style accuracy compatible with the other two scores, we scale it into range [0,5]. If the scores from the two annotators have a difference larger than 2, the scores will not be recorded. In this way, we ensure the evaluation criteria of annotators are similar.