HyperText: Endowing FastText with Hyperbolic Geometry

Natural language data exhibit tree-like hierarchical structures such as the hypernym-hyponym hierarchy in WordNet. FastText, as the state-of-the-art text classifier based on shallow neural network in Euclidean space, may not represent such hierarchies precisely with limited representation capacity. Considering that hyperbolic space is naturally suitable for modelling tree-like hierarchical data, we propose a new model named HyperText for efficient text classification by endowing FastText with hyperbolic geometry. Empirically, we show that HyperText outperforms FastText on a range of text classification tasks with much reduced parameters.


Introduction
FastText (Joulin et al., 2016) is a simple and efficient neural network for text classification, which achieves comparable performance to many deep models like char-CNN (Zhang et al., 2015) and VDCNN (Conneau et al., 2016), with a much lower computational cost in training and inference.However, natural language data exhibit tree-like hierarchies in several respects (Dhingra et al., 2018) such as the taxonomy of WordNet.In Euclidean space the representation capacity of a model is strictly bounded by the number of parameters.Thus, conventional shallow neural networks (e.g., FastText) may not represent tree-like hierarchies efficiently given limited model complexity.
Fortunately, hyperbolic space is naturally suitable for modeling the tree-like hierarchical data.Theoretically, hyperbolic space can be viewed as a continuous analogue of trees, and it can easily embed trees with arbitrarily low distortion (Krioukov et al., 2010).Experimentally, Nickel and  Inspired by their work, we propose HyperText for text classification by endowing FastText with hyperbolic geometry.We base our method on the Poincaré ball model of hyperbolic space.Specifically, we exploit the Poincaré ball embedding of words or ngrams to capture the latent hierarchies in natural language sentences.Further, we use the Einstein midpoint (Gulcehre et al., 2018) as the pooling method to emphasize semantically specific words which usually contain more information but occur less frequently than general ones (Dhingra et al., 2018).Finally, we employ Möbius linear transformation (Ganea et al., 2018) to play the part of the hyperbolic classifier.We evaluate the performance of our approach on text classification task using ten standard datasets.We observe HyperText † outperforms FastText on eight of them.Besides, HyperText is much more parameterefficient.Across different tasks, only 17% ∼ 50% parameters of FastText are needed for HyperText to achieve comparable performance.Meanwhile, the computational cost of our model increases moderately (2.6x in inference time) over FastText.

Overview
Figure 1 illustrates the connection and distinction between FastText and HyperText.The differences of the model architecture are three-fold: First, the input token in HyperText is embedded using hyperbolic geometry, specifically the Poincaré ball model, instead of Euclidean geometry.Second, Einstein midpoint is adopted in the pooling layer so as to emphasize semantically specific words.Last, the Möbius linear transformation is chosen as the prediction layer.Besides, the Riemannian optimization is employed in training HyperText.

Poincaré Embedding Layer
There are several optional models of hyperbolic space such as the Poincaré ball model, the Hyperboloid model and the Klein model, which offer different affordances for computation.In HyperText, we choose the Poincaré ball model to embed the input words and ngrams so as to better exploit the latent hierarchical structure in text.The Poincaré ball model of hyperbolic space corresponds to the Riemannian manifold which is defined as follow: where 1} is an open ddimensional unit ball ( • denotes the Euclidean norm) and g x is the Riemannian metric tensor.
where λ x = 2 1− x 2 is the conformal factor, g E = I d denotes the Euclidean metric tensor.While performing back-propagation, we use the Riemannian gradients to update the Poincaré embedding.The Riemannian gradients are computed by rescaling the Euclidean gradients: (3) Since ngrams retain the sequence order information, given a text sequence S = {w i } m i=1 , we embed all the words and ngrams into the Poincaré ball, denoted as X = {x i } k i=1 , where x i ∈ B d .

Einstein midpoint Pooling Layer
Average pooling is a normal way to summarize features as in FastText.In Euclidean space, the average pooling is To extend the average pooling operation to the hyperbolic space, we adopt a weighted midpoint method called the Einstein midpoint (Gulcehre et al., 2018).In the d-dimensional Klein model K d , the Einstein midpoint takes the weighted average of embeddings, which is given by: where 2 are the Lorentz factors.However, our embedding layer is based on the Poincaré model rather than the Klein model, which means we can't directly compute the Einstein midpoints using Equation ( 5).Nevertheless, the various models commonly used for hyperbolic geometry are isomorphic, which means we can first project the input embedding to the Klein model, execute the Einstein midpoint pooling, and then project results back to the Poincaré model.
The transition formulas between the Poincaré and Klein models are as follow: where x P and x K respectively denote token embeddings in the Poincaré and Klein models.mP and mK are the Einstein midpoint pooling vectors in the Poincaré and Klein models.It should be noted that points near the boundary of the Poincaré ball get larger weights in the Einstein midpoint formula.These points (tokens) are regarded to be more representative, which can provide salient information for the text classification task (Dhingra et al., 2018).

Möbius Linear Layer
The Möbius linear transformation is an analogue of linear mapping in Euclidean neural networks.We use the Möbius linear to combine features outputted by the pooling layer and complete the classification task, which takes the form: where ⊗ and ⊕ denote the Möbius matrix multiplication and Möbius addition defined as follows (Ganea et al., 2018): where M ∈ R d×n denotes the weight matrix, and n denotes the number of class; b ∈ R n is the bias vector and c is a hyper-parameter that denotes the curvature of hyperbolic spaces.In order to obtain the categorical probability ŷ , a softmax layer is used after the Möbius linear layer.

Model Optimization
This paper uses the cross-entropy loss function for the multi-class classification task: where N is the number of training examples, and y is the one-hot representation of ground-truth labels.For training, we use the Riemannian optimizer (Bécigneul and Ganea, 2018) which is more accurate for the hyperbolic models.We refer the reader to the original paper for more details.

Experimental setup
Datasets To make a comprehensive comparison with FastText, we choose the same eight datasets as in Joulin et al. (2016) in our experiments.Also, we add two Chinese text classification datasets from Chinese CLUE (Xu et al., 2020), which are presumably more challenging.We summarize the statistics of datasets used in our experiments in Table 2.
Hyperparameters Follow Joulin et al. (2016), we set the embedding dimension as 10 for first eight datasets in Table 1.On TNEWS and IFLY-TEK datasets, we use 200-dimension and 300dimension embeddings respectively.The learning rate is selected on a validation set from {0.001, 0.05, 0.01, 0.015}.In addition, we use PKUSEG tool (Luo et al., 2019) for Chinese word segmentation.

Comparison with FastText and deep models
The results of our experiments are displayed in Table 1.Our proposed HyperText model outperforms FastText on eight out of ten datasets, and the accuracy of HyperText is 0.7% higher than Fast-Text on average.In addition, from the results, we observe that HyperText works significantly better than FastText on the datasets with more label categories, such as Yah.A., TNEWS and IFLYTEK.This arguably confirms our hypothesis that Hyper-Text can better model the hierarchical relationships of the underlying data and extract more discriminative features for classification.Moreover, Hy-perText outperforms DistilBERT (Sanh et al., 2019) and FastBERT (Liu et al., 2020) which are two distilled versions of BERT.And HyperText achieves comparable performance to the very deep convolutional network (VDCNN) (Conneau et al., 2016) which consists of 29 convolutional layers.From the results, we can see that HyperText has better or comparable classification accuracy than these deep models while requiring several orders of magnitude less computation.
Embedding Dimension Since the input embeddings account for more than 90% model parameters, we investigate the impact of dimension of input embedding on the classification accuracy.
The experimental results are presented in Figure 2. As we can see, on most tasks HyperText performs consistently better than FastText in various dimension settings.In particular, on IFLYTEK and TNEWS datasets, HyperText with 50-dimension respectively achieves better performance to FastText with 300-dimension and 200-dimension.On other eight less challenging datasets, the experiments are conducted in the low-dimensional settings and Hy-perText often requires less dimensions to achieve the optimal performance in general.It verifies that thanks to the ability to capture the internal structure of the text, the hyperbolic model is more parameter efficient than its Euclidean competitor.
Computation in Inference FastText is wellknown for its fast inference speed.We compare the FLOPs versus accuracy under different dimensions in Figure 3. Due to the additional nonlinear operations, HyperText generally requires more (4.5 ∼ 6.7x) computations compared to Fast-Text with the same dimension.But since Hyper-Text is more parameter efficient, when constrained on the same level of FLOPs, HyperText mostly performs better than FastText on the classification accuracy.Besides, the FLOPs level of VDCNN is 10 5 higher than HyperText and FastText.
Ablation study We conduct the ablation study to figure out the contribution of different layers.
The results on several datasets are present in Table 3.Note that whenever we replace a hyperbolic

Related Work
Hyperbolic space can be regarded as a continuous version of tree, which makes it a natural choice to represent the hierarchical data (Nickel andKiela, 2017, 2018;Sa et al., 2018).Hyperbolic geometry has been applied to learning knowledge graph  Chami et al. (2019Chami et al. ( , 2020) ) uses the hyperbolic rotations and reflections to better model the rich kinds of relations in knowledge graphs.Specifically, the authors use the hyperbolic rotations to capture antisymmetric relations and hyperbolic reflections to capture symmetric relations, and combine these operations together by the attention mechanism.It achieves significant improvement at low dimension.
Hyperbolic geometry is also applied in natural language data so as to exploit the latent hierarchies in the word sequences (Tifrea et al., 2019).Recently, many hyperbolic geometry based deep neural networks (Gulcehre et al., 2018;Ganea et al., 2018) achieve promising results, especially when the mount of parameters is limited.There are some applications based on hyperbolic geometry, such as question answering system (Tay et al., 2018), recommendation system (Chamberlain et al., 2019) and image embedding (Khrulkov et al., 2020).

Conclusion
We have shown that hyperbolic geometry can endow the shallow neural networks with the ability to capture the latent hierarchies in natural language.The empirical results indicate that HyperText consistently outperforms FastText on a variety of text classification tasks.On the other hand, Hyper-Text requires much less parameters to retain performance on par with FastText, which means neural networks in hyperbolic space could have a stronger representation capacity.

Figure 1 :
Figure 1: The architecture comparison of FastText (upper) and HyperText (lower).Kiela (2017) first used the Poincaré ball model to embed hierarchical data into hyperbolic space and achieved promising results on learning word embeddings in WordNet.Inspired by their work, we propose HyperText for text classification by endowing FastText with hyperbolic geometry.We base our method on the Poincaré ball model of hyperbolic space.Specifically, we exploit the Poincaré ball embedding of words or ngrams to capture the latent hierarchies in natural language sentences.Further, we use the Einstein midpoint(Gulcehre et al., 2018) as the pooling method to emphasize semantically specific words which usually contain more information but occur less frequently than general ones(Dhingra et al., 2018).Finally, we employ Möbius linear transformation(Ganea et al., 2018) to play the part of the hyperbolic classifier.We evaluate the performance of our approach on text classification task using ten standard datasets.We observe HyperText † outperforms FastText on eight of them.Besides, HyperText is much more parameterefficient.Across different tasks, only 17% ∼ 50%

Figure 3 :
Figure 3: FLOPs(×10 3 ) vs Accuracy(%) under different dimensions.The x-axis represents the FLOPs, while the y-axis represents the accuracy.Different points represent different embedding dimensions

Table 1 :
Liu et al. (2020)fferent models.*The results of DistilBERT are cited fromLiu et al. (2020) Figure 2: Accuracy vs Embedding dimension.The x-axis represents the embedding dimension, while the y-axis represents the accuracy.