APo-VAE: Text Generation in Hyperbolic Space

Natural language often exhibits inherent hierarchical structure ingrained with complex syntax and semantics. However, most state-of-the-art deep generative models learn embeddings only in Euclidean vector space, without accounting for this structural property of language. In this paper, we investigate text generation in a hyperbolic latent space to learn continuous hierarchical representations. An Adversarial Poincare Variational Autoencoder (APo-VAE) is presented, where both the prior and variational posterior of latent variables are defined over a Poincare ball via wrapped normal distributions. By adopting the primal-dual formulation of Kullback-Leibler divergence, an adversarial learning procedure is introduced to empower robust model training. Extensive experiments in language modeling, unaligned style transfer, and dialog-response generation demonstrate the effectiveness of the proposed APo-VAE model over VAEs in Euclidean latent space, thanks to its superb capabilities in capturing latent language hierarchies in hyperbolic space.


Introduction
The Variational Autoencoder (VAE) (Kingma and Welling, 2013;Rezende et al., 2014) is a generative model widely applied to language-generation tasks, which propagates latent codes drawn from a simple prior to manifest data samples through a decoder. The generative model is augmented by an inference network, which feeds observed data samples through an encoder to yield a distribution on the corresponding latent codes. Since natural language often manifests a latent hierarchical structure, it is desirable for the latent code in a VAE to reflect such inherent language structure, so that the generated text can be more natural and expressive. An example of language structure is illustrated in Figure 1, where sentences are organized into a tree structure. * Work was done when the author interned at Microsoft. Figure 1: Illustration of the latent hierarchy in natural language. Each tree node is a latent code of its corresponding sentence.
The root node corresponds to simple sentences (e.g., "Yes"), while nodes on outer leaves represent sentences with more complex syntactic structure and richer, more specific semantic meaning (e.g., "The food in the restaurant is awesome") 1 .
In existing VAE-based generative models, such structures are not explicitly considered. The latent code often employs a simple Gaussian prior, and the posterior is approximated as a Gaussian with diagonal covariance matrix. Such embeddings assume Euclidean structure, which is inadequate in capturing geometric structure illustrated in Figure 1. While some variants have been proposed to enrich the prior distributions (Xu and Durrett, 2018;Wang et al., 2019a,b;Shi et al., 2019), there is no evidence that structural information in language can be recovered effectively by the model.
Hyperbolic geometry has recently emerged as an effective method for representation learning from data with hierarchical structure (Mathieu et al., 2019;Nickel and Kiela, 2017). Informally, hyperbolic space can be considered as a continuous map of trees. For example, a Poincaré disk (a hyperbolic space with two dimensions) can represent any tree with arbitrary low distortion (De Sa et al., 2018;Sarkar, 2011). In Euclidean space, however, it is difficult to learn such structural representation even with infinite dimensions (Linial et al., 1995).
Motivated by these observations, we propose Adversarial Poincaré Variational Autoencoder (APo-VAE), a text embedding and generation model based on hyperbolic representations, where the latent code is encouraged to capture the underlying tree-like structure in language. Such latent structure provides more control of the generated sentences, i.e., an increase of sentence complexity and diversity can be achieved along some trajectory from a root to its children. In practice, we define both the prior and the variational posterior of the latent code over a Poincaré ball, via the use of a wrapped normal distribution (Nagano et al., 2019).
To obtain more stable model training and learn more flexible representation of the latent code, we exploit the primal-dual formulation of Kullback-Leibler (KL) divergence  based on the Fenchel duality (Rockafellar et al., 1966), to adversarially optimize the variational bound. Unlike the primal form that relies on Monte Carlo approximation (Mathieu et al., 2019), our dual formulation bypasses the need for tractable posterior likelihoods via the introduction of an auxiliary dual function.
We apply the proposed approach to language modeling, unaligned style transfer and dialogresponse generation. For language modeling, in order to enhance the distribution complexity of the prior, we use an additional "variational mixture of posteriors" prior (VampPrior) design (Tomczak and Welling, 2018) for the wrapped normal distribution. Specifically, VampPrior uses a mixture distribution with components from variational posteriors, coupling the parameters of the prior and variational posterior. For unaligned style transfer, we add a sentiment classifier to our model, and disentangle content and sentiment information by using adversarial training (Zhao et al., 2017a). For dialog-response generation, a conditional model variant of APo-VAE is designed to take into account the dialog context. Experiments also show that the proposed model addresses posterior collapse (Bowman et al., 2016), a major obstacle preventing efficient learning of a VAE on text data. In posterior collapse, the encoder learns an approximate posterior similar to the prior, and the decoder tends to ignore the latent code for generation. Experiments show that our proposed model can effectively avoid posterior collapse. We hypothesize that this is due to the use of a more informative prior in hyperbolic space that enhances the complexity of the latent representation, which aligns well with previous work (Tomczak and Welling, 2018;Wang et al., 2019a) that advocates a better prior design.
Our main contributions are summarized as follows. (i) We present Adversarial Poincaré Variational Autoencoder (APo-VAE), a novel approach to text embedding and generation based on hyperbolic latent representations. (ii) In addition to the use of a wrapped normal distribution, an adversarial learning procedure and a VampPrior design are incorporated for robust model training. (iii) Experiments on language modeling, unaligned style transfer, and dialog-response generation benchmarks demonstrate the superiority of the proposed approach compared to Euclidean VAEs, as it benefits from capturing informative latent hierarchies in natural language.

Variational Autoencoder
is a sequence of tokens of length T i . Our goal is to learn p θ (x) that best models the observed sentences so that the expected log-likelihood is maximized, i.e., L(θ) = The variational autoencoder (VAE) (Kingma and Welling, 2013;Chen et al., 2018b) considers a latent-variable model p θ (x, z) to represent sentences, with an auxilary encoder that draws samples of latent code z from the conditional density q φ (z|x), known as the approximate posterior. Given a latent code z, the decoder samples a sentence from the conditional density p θ (x|z) = t p(x t |x <t , z), where the "decoding" pass takes an auto-regressive form. Together with prior p(z), the model is given by the joint p θ (x, z) = p θ (x|z)p(z). The VAE leverages the approximate posterior to derive an evidence lower bound (ELBO) to the (intractable) marginal log-likelihood log p θ (x) = log p θ (x, z) dz: where (θ, φ) are jointly optimized during training, and the gap is given by the decomposition where D KL denotes Kullback-Leibler divergence. Alternatively, the ELBO can be written as: where the first conditional likelihood and second KL terms respectively characterize reconstruction and generalization capabilities. Intuitively, a good model is expected to strike a balance between good reconstruction and generalization. In most cases, both the prior and variational posterior are assumed to be Gaussian for computational convenience. However, such over-simplified assumptions may not be ideal for capturing the intrinsic characteristics of data that have unique geometrical structure, such as natural language.

Hyperbolic Space
Riemannian manifolds can provide a more powerful and meaningful embedding space for complex data with highly non-Euclidean structure, that cannot be effectively captured in a vectorial form (e.g., social networks, biology and computer graphics). Of particular interest is the hyperbolic space , where (i) the relatively simple geometry allows tractable computations, and (ii) the exponential growth of distance in finite dimensions naturally embeds rich hierarchical structure in a compact form.

Riemannian
Geometry. An n-dimensional Riemannian manifold M n is a set of points locally similar to a linear space R n . At each point x of the manifold M n , we can define a real vector space T x M n that is tangent to x, along with an associated metric tensor g x (·, ·) : T x M n × T x M n → R which is an inner product on T x M n . Intuitively, a Riemannian manifold behaves like a vector space only in its infinitesimal neighborhood, allowing the generalization of common notation like angle, straight line and distance to a smooth manifold. For each tangent space T x M n , there exists a specific one-to-one map exp x (v) : T x M n → M n from an -ball at the origin of T x M n to a neighborhood of x on M n , called the exponential map. We refer to the inverse of an exponential map as the logarithm map, denoted log x (y) : M n → T x M n . In addition, a parallel transport P x→x : T x M n → T x M n intuitively transports tangent vectors along a "straight" line between x and x , so that they remain "parallel." This is the basic machinery that allows us to generalize distributions and computations in the hyperbolic space, as detailed in later sections.
Poincaré Ball Model. Hyperbolic geometry is one type of non-Euclidean geometry with a constant negative curvature. As a classical example of hyperbolic space, an n-dimensional Poincaré ball, with curvature parameter c ≥ 0 (i.e., radius 1 √ c ), can be denoted as B n c := z ∈ R n | c z 2 < 1 with its metric tensor given by g c z = λ 2 z g E , where λ z = 2 1−c z 2 and g E denotes the regular Euclidean metric tensor. Intuitively, as z moves closer to the boundary 1 √ c , the hyperbolic distance between z and a nearby z diverges at a rate of 1 1−c z 2 → ∞. This implies significant representation capacity, as very dissimilar objects can be encoded on a compact domain. Note that as c → 0, the model recovers the Euclidean space R n , i.e., the lack of hierarchy. In comparison, a larger c implies a stronger hierarchical organization. 2 Mathematical Operations. We review the closed-form mathematical operations that enable differentiable training for hyperbolic space models, namely the hyperbolic algebra (vector addition) and tangent space computations (exponential/logarithm map and parallel transport). The hyperbolic algebra is formulated under the framework of gyrovector spaces (Ungar, 2008), with the addition of two points z, z ∈ B n c given by the Möbius addition: (1 + 2c z, z + c z 2 )z + (1 − c z 2 )z 1 + 2c z, z + c 2 z 2 z 2 .
For any point µ ∈ B n c , the exponential map and the logarithmic map are given for u = 0 and y = µ is text sequential data, and s k = [s k,1 , ..., s k,T ] is the pseudo-input. The posterior (blue) is obtained by (7), and VampPrior (red) is achieved by (12).

Adversarial Poincaré VAE
We first introduce our hyperbolic encoder and decoder, and how to apply reparametrization. We then provide detailed descriptions on model implementation, explaining how the primal-dual form of KL divergence can help stabilize training. Finally, we describe how to adopt VampPrior (Tomczak and Welling, 2018) to enhance performance. A summary of our model scheme is provided in Figure 2.

Flexible Wrapped Distribution Encoder
We begin by generalizing the standard normal distribution to a Poincaré ball . While there are a few competing definitions of the hyperbolic normal, we choose the wrapped normal as our prior and variational posterior, largely due to its flexibility for more expressive generalization. A wrapped normal distribution N B n c (µ, Σ) is defined as follows: (i) sample vector v from N (0, Σ), (ii) parallel transport v to u := P c 0→µ (v), and (iii) using exponential map to project u back to z := exp c µ (u). Putting these together, a latent sample has the following reparametrizable form: For approximate posteriors, (µ, Σ) depends on x.
We further generalize the (restrictive) hyperbolic wrapped normal by acknowledging that under the implicit VAE  framework, one does not need the approximate posterior q φ (z|x) to be analytically tractable. This allows us to replace the tangent space sampling step v ∼ N (0, Σ) in (7) with a more flexible implicit distribution from which we draw samples as v := G(x, ξ; φ 1 ) for ξ ∼ N (0, I). Note that now µ := F (x; φ 2 ) can be regarded as a deterministic displacement vector that anchors embeddings to the correct semantic neighborhood, allowing the stochastic v to only focus on modeling the local uncertainty of the semantic embedding. The synergy between the deterministic and stochastic parts enables efficient representation learning relative to existing alternatives. For simplicity, we denote the encoder neural network as EncNet φ , which contains G and F , with

Poincaré Decoder
To build a geometry-aware decoder for a hyperbolic latent code, we follow , and use a generalized linear function analogously defined in the hyperbolic space. A Euclidean linear function takes the form is a hyperplane passing through b with a as the normal direction, and d E (z, H) is the distance between z and hyperplane H. The counterpart in Poincaré ball analogously writes are the the gyroplane and the distance between z and the gyroplane, respectively. Specifically, we use the hyperbolic linear function in (8) to extract features from the Poincaré embedding z. The feature f c a,b (z) will be the input to the RNN decoder. We denote the combined network of f c a,b and the RNN decoder as DecNet θ , where parameters θ contain a and b.

Implementing APo-VAE
While it is straightforward to compute the ELBO (3) via Monte Carlo estimates using the explicit wrapped normal density (Mathieu et al., 2019), we empirically observe that: (i) the normal assumption restricts the expressiveness of the model, and (ii) the wrapped normal likelihood makes the training unstable. Therefore, we appeal to a primal-dual view of VAE training to overcome such difficulties (Rockafellar et al., 1966;Tao et al., 2019;. Specifically, the KL term in (3) can be reformulated as: where ν ψ (x, z) is the (auxiliary) dual function (i.e., a neural network) with parameters ψ. The primaldual view of the KL term enhances the approximation ability, while also being tractable computationally. Meanwhile, since the density function in the original KL term in (3) is replaced by the dual function ν ψ (x, z), we can avoid direct computation with respect to the probability density function of the wrapped normal distribution.
To train our proposed APo-VAE with the primaldual form of the VAE objective, we follow the training schemes of coupled variational Bayes (CVB)  and implicit VAE , which optimize the objective adversarially. Specifically, we update ψ in the dual function ν ψ (x, z) to maximize: Algorithm 1 Training procedure of APo-VAE.

5:
# Sampling in the Hyperbolic Space.

8:
Map u m to z m = exp c µm (u m ) by (5).
9: # Update the dual function and the pseudo-input.

13:
Update θ and φ by gradient ascent on (11). 14: end for where E x∼X [·] denotes the expectation over empirical distribution on observations. Accordingly, parameters θ and φ are updated to maximize: (10) while it is minimized in (11), i.e., adversarial learning. In other words, one can consider the dual function as a discriminative network that distinguishes between the prior z ∼ p(z) and the variational posterior z ∼ q φ (z|x), both of which are paired with the input data x ∼ X.

Data-driven Prior
While the use of a standard normal prior is a simple choice in Euclidean space, we argue that it induces bias in the hyperbolic setup. This is because natural sentences have specific meaning, and it is unrealistic to have the bulk of mass concentrated in the center (this is for low dimension; for high dimensions, it will concentrate near the surface of a sphere, which may partly explain why cosine similarity works favorably compared with Euclidean distance for NLP applications).
To reduce the induced bias from a pre-fixed prior, we adopt the VampPrior framework (Tomczak and Welling, 2018), which is a mixture of variational posteriors conditioned on learnable pseudo-data points. Specifically, we consider the prior as a learnable distribution given by where q φ is the learned approximate posterior, and we call the parameter δ := {s k } K k=1 pseudo inputs. Intuitively, p δ (z) seeks to match the aggregated posterior : Hyperbolic Space Representation Learning. There has been a recent surge of interest in representation learning in hyperbolic space, largely due to its exceptional effectiveness modeling data with underlying graphical structure (Chamberlain et al., 2017), such as relation nets (Nickel and Kiela, 2017). In the context of NLP, hyperbolic geometry has been considered for word embeddings (Tifrea et al., 2018). A popular vehicle for hyperbolic representation learning is the autoencoder (AE) framework (Grattarola et al., 2019;Ovinnikov, 2019), where the decoders are built to efficiently exploit the hyperbolic geometry . Closest to our APo-VAE are the works of hyperbolic VAEs (Mathieu et al., 2019;Nagano et al., 2019), where wrapped normal distributions have been used. Drawing power from the dual form of the KL, the proposed APo-VAE highlights an implicit posterior and data-driven prior, showing improved training stability.

Experiments
We evaluate the proposed model on three tasks: (i) language modeling, (ii) unaligned style transfer, and (iii) dialog-response generation, with quantitative results, human evaluation and qualitative analysis.

Experimental Setup
Datasets. We use three datasets for language modeling: Penn Treebank (PTB) (Marcus et al., 1993), Yahoo and Yelp corpora (Yang et al., 2017). PTB contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths. Yahoo and Yelp are much larger datasets, each containing 100k sentences with greater average length.
For unaligned style transfer, we use the Yelp restaurant reviews dataset (Shen et al., 2017), which is obtained by pre-processing the Yelp dataset, i.e., sentences are shortened for more feasible sentence level sentiment analysis. Overall, the dataset includes 350k positive and 250k negative reviews (based on user rating).
Following , we use the Switchboard (Godfrey and Holliman, 1997) dataset for dialogue-response generation. The former contains 2.4k two-sided telephone conversations, manually transcribed and aligned. We split the data into training, validation and test sets following the protocol described in Zhao et al. (2017b).
Evaluation Metrics. To benchmark language modeling performance, we report the ELBO and Perplexity (PPL) of APo-VAE and baselines. In order to verify our proposed Apo-VAE is more resistant to posterior collapse, we also report the KLdivergence D KL (q φ (z|x) p(z)) and mutual information (MI) between z and x (He et al., 2019). The number of active units (AU) of the latent code is also reported, where the activity of a latent dimension z is measured as A z = Cov x E z∼q φ (z|x) [z], and defined as active if A z > 0.01.
To evaluate our model on unaligned style transfer, we consider the transfer accuracy from one sentiment to another, the BLEU scores between original and transferred sentences, the reconstruction perplexity of original sentences, and the reverse perplexity (RPPL) based on a language model from the transferred sentences.
For dialogue-response generation, we adopt the evaluation metrics used in previous studies (Zhao et al., 2017b;, including BLEU (Papineni et al., 2002), BOW , and intra/inter-dist values . The first two metrics are used to assess the relevance of the generated response, and the third is for diversity evaluation.
Model Implementation. For language modeling, we adopt the LSTM  for both the encoder and decoder, with dimension of the latent code set to 32. Following Mathieu et al. (2019), the hyper-parameter c is set to 0.7. For unaligned style transfer, we extend our model in the same fashion as . For dialogue-response generation, we modify APo-VAE following the conditional VAE framework (Zhao et al., 2017b). Specifically, an extra input of context embedding s is supplied to the model (i.e., p θ (x, z|s), q φ (z|x, s)). The prior p(z|s) is a wrapped normal conditioned on context embedding, learned together with the posterior.

Experimental Results
Language Modeling. VAE) (Mathieu et al., 2019) and iVAE 3 . On all three datasets, the proposed model achieves lower negative ELBO and PPL than other models, demonstrating its strong ability to better model sequential text data. Meanwhile, the larger KL term and higher mutual information (between z and x) of APo-VAE model indicate its robustness in handling posterior collapse. In addition, the introduction of a data-driven prior (denoted as APo-VAE+VP) further boosts the performance, especially on negative ELBO and PPL.
Visualization. To verify our hypothesis that the proposed model is capable of learning latent tree structure in text data, we visualize the twodimensional projection of the learned latent code in Figure 3. For visualization, we randomly draw 5k samples from PTB-test, and encode them to the latent space using the APo-VAE encoder. We color-code each sentence based on its length (i.e., blue for long sentences and red for short sentences). Note that only a small portion of data have a length longer than 32 (< 10%), and human inspection   verified that most of them contain multiple subsentences. We exclude these samples from our analysis. As shown in Figure 3, longer sentences (dark blue) tend to occupy the outer rim of the Poincaré ball, while the shorter ones (dark red) are concen-  trated in the inner area. We also select some long sample sentences (dark blue), and manually shorten them to create several variants of different lengths (ranging from 6 to 27), which are related in a hierarchical manner based on human judgement. We visualize their latent codes projected by the trained APo-VAE. The resulting plot is consistent with a hierarchical structure for APo-VAE: as the sentence becomes more specific, the embedding moves outward. We also decode from the neighbours of these latent codes, the outputs (see the Appendix) of which demonstrate a similar hierarchical structure.
Unaligned Style Transfer. Table 3 shows the results on the Yelp restaurant reviews dataset. APo-VAE achieves over 1 point increased BLEU scores than iVAE, capturing a more informative and structured feature space. Comparable performance is achieved for the other evaluation metrics.
Dialogue Response Generation. Results on SwitchBoard are summarized in Table 2. Our proposed model generates comparable or better responses than the baseline models in terms of both relevance (BLEU and BOW) and diversity (intra/inter-dist). APo-VAE improves the average recall from 0.427 (by iVAE) to 0.438, while significantly enhancing generation diversity (e.g., from 0.692 to 0.792 for intra-dist-2).
Human Evaluation. We further perform human evaluation via Amazon Mechanical Turk. We asked the turkers to compare generated responses from two models, and assess each model's informative-ness, relevance to the dialog context (coherence), and diversity. We use 500 randomly sampled contexts from the test set, each assessed by three judges. In order to evaluate diversity, 5 responses are generated for each dialog context. For quality control, only workers with a lifetime task approval rating greater than 98% were allowed to participate in our study. Table 4 summarizes the human evaluation results. The responses generated by our model are clearly preferred by the judges compared with other competing methods.

Conclusions
We present APo-VAE, a novel model for text generation in hyperbolic space. Our model can learn latent hierarchies in natural language via the use of wrapped normals for the prior. A primal-dual view of KL divergence is adopted for robust model training. Extensive experiments on language modeling, text style transfer, and dialog response generation demonstrate the superiority of the model. For future work, we plan to combine APo-VAE with the currently prevailing large-scale pre-trained language models.

A1 Basics of Riemannian Geometry (Extended)
This section provides additional coverage of the basic mathematical concepts used in APo-VAE. For more detailed mathematics expositions, readers are referred to . A real, smooth n-dimensional manifold M is a set of points locally similar to a linear space R n . At each point x of the manifold M is defined a real vector space of the space of the same dimensionality as M, called the tangent space in x: T x M. Intuitively it contains all the possible directions in which one can tangentially pass through x. For each point x there also defines a metric tensor g x (·, ·) : T x M × T x M → R, which defined an inner product on the associated tangent space T x M. More specifically, given a coordinate system, the inner product is given in the quadratic where by slight abuse of notation u, v ∈ R n are vector representations of the tangent vectors wrt the coordinate system and G x ∈ R n×n is a positive-definite matrix. Collectively, (M, g) defines a Riemannian manifold.
(a) Parallel transport a tangent vector (arrow) from one location (black) to another (orange).
(b) Map the transported tangent vector (orange) to a point (green) in the hyperbolic space by using the exponential map. Figure A1: Visualization of different mathematical operations in a hyperbolic space, that are used to define the wrapped distribution.
The metric tensor is used to generalize the notations such as distance and volume in Euclidean space to the Riemannian manifold. Given a curve γ(t) : [0, 1] → M, its length is given by Figure A2: Mapping a Gaussian distribution (red) and a implicit distribution (blue) to the hyperbolic space.
The concept of straight lines can then be generalized to geodesics, which is the shortest path between pairs of points x, y on the manifold γ * (x, y) = arg min γ L(γ) such that γ(0) = x and γ(1) = y with γ traveling at constant speed (i.e., γ (t) γ(t) = c, where c is the distance). The concept of moving along a straight line with constant speed defines the exponential map, where for v ∈ T x M there is a unique unit speed geodesic γ satisfying γ(0) = x with initial tangent vector v, and the corresponding exponential map is defined by exp x (v) = γ(1). We call the inverse of exponential map the logarithm map log x = exp −1 x : M → T x M, mapping from the manifold to the tangent space. For the Poincáre ball model, it is geodesically complete in the sense that exp x is well-defined on the full tangent space T x M.

A2 Additional Related Work
VAE with Adversarial Learning. One of the first to apply adversarial learning to VAE is Adversarial Variational Bayes (AVB) (Mescheder et al., 2017;Pu et al., 2017). Motivated by Generative Adversarial Network (GAN) (Goodfellow et al., 2014), AVB introduces an auxiliary discriminator that transforms the maximum-likelihood-problem into a two-player game. Similarly, Adversarial Autoencoder (AAE)  uses adversarial learning to match aggregated posterior with the prior. Based on this, Coupled Variational Bayes (CVB)  connects the primal-dual view of ELBO with adversarial learning, where the discriminator takes both data sample and latent code as input. This approach is also adopted in implicit VAE  for text generation. However, the prior used in implicit BLEU (Papineni et al., 2002) is used to measure the amount of n-gram overlap between a generated response with the reference. Specifically, BLEU scores of n < 4 are computed; the average and the maximum scores are considered as n-gram precision and n-gram recall, respectively. In addition, the BOW embedding metric  is used to measure cosine similarity between bag-ofword embeddings of response and reference. Three metrics are considered for cosine distance: (i) computed by greedily matching words in two utterances; (ii) between the averaged embeddings in two utterances; and (iii) between the largest extreme values among the embeddings in two utterances. We also follow Gu et al. and use the distinct metric to measure the diversity of generated text. Dist − n is the ratio of unique n-grams over all n-grams in the generated sentences. Intra-dist and inter-dist are the average distinct values within each generated sample and among all generated samples, respectively.
(a) Hyperbolic latent space for Yahoo.
(b) Hyperbolic latent space for Yelp. Figure A4: Visualization of the hyperbolic latent space of 5,000 randomly sampled sentences from different datasets.

A3.2 Additional Implementation Details
For language modeling, we adopt the LSTM  for both the encoder and decoder, which have 256 hidden units for PTB, and 1024 hidden units for Yahoo and Yelp. The dimension of the latent code is set to 32. Following Mathieu et al. (2019), the hyper-parameter c is set to 0.7. We set the vocabulary size to 10,000 for PTB, and 20,000 for both Yahoo and Yelp. The word embedding size is 256 for PTB, and 512 for Yahoo and Yelp. For dialogue response generation, we follow , and use the GRU (Cho et al., 2014) with 300 hidden units in each direction for both the response encoder and context encoder, and 300 hidden units for decoder. The latent code z has a dimension of 200. The size of the vocabulary is set to 10,000, and the word-embedding size is 200, Input Sample the national cancer institute ban smoking the unk and drug administration were talking the national cancer institute warns citizens to avoid smoking cigarette the unk and drug administration officials also are used to unk the national cancer institute claims that smoking cigarette too often would increase the chance of getting lung cancer the u.s. and drug administration officials say they are n't unk to be used by the government the national cancer institute also projected that overall u.s. mortality rates from lung cancer should begin to drop in several years if cigarette smoking continues to decrease the u.s. and drug administration officials unk they are looking for ways to <unk> their own accounts for some of their assets to be sold by some companies  initialized by GloVe (Pennington et al., 2014).

A3.3 Additional Results
For language modeling, we plot the hyperbolic latent space for Yahoo and Yelp as shown in Figure A4. To demonstrate the hierarchical structure in the generated sentences (i.e., the decoder), we choose 4 sentences (from short to long) with some hierarchy, listed on the left hand side of Table A1. These sentences are encoded to hyperbolic space by using a well trained APo-VAE. Then, we decode by randomly select a neighbor of each of the 4 latent codes. The output sentences are shown on the right hand side of Table A1, demonstrating similar hierarchy as the input sentences. Moreover, we directly measure the generation quality by using PPL and reverse PPL, shown in Table A3. Our APo-VAE achieves consistently better performance. For dialog response generation, we include additional results on the DailyDialog dataset , which contains 13k daily conversations for an English learner in daily life. We also provide examples of generated responses along with their corresponding dialog context in Table A5. Samples generated by APo-VAE are more relevant to the contexts than the baseline models. In addition,