Joint Embedding of Words and Labels for Text Classification

Word embeddings are effective intermediate representations for capturing semantic regularities between words, when learning the representations of text sequences. We propose to view text classification as a label-word joint embedding problem: each label is embedded in the same space with the word vectors. We introduce an attention framework that measures the compatibility of embeddings between text sequences and labels. The attention is learned on a training set of labeled samples to ensure that, given a text sequence, the relevant words are weighted higher than the irrelevant ones. Our method maintains the interpretability of word embeddings, and enjoys a built-in ability to leverage alternative sources of information, in addition to input text sequences. Extensive results on the several large text datasets show that the proposed framework outperforms the state-of-the-art methods by a large margin, in terms of both accuracy and speed.


Introduction
Text classification is a fundamental problem in natural language processing (NLP).The task is to annotate a given text sequence with one (or multiple) class label(s) describing its textual content.A key intermediate step is the text representation.Traditional methods represent text with hand-crafted features, such as sparse lexical features (e.g., n-grams) (Wang and Manning, 2012).Recently, neural models have been employed to learn text representations, including convolutional neural networks (CNNs) (Kalchbrenner et al., 2014;Zhang et al., 2017b;Shen et al., 2017) and recurrent neural networks (RNNs) based on long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997;Wang et al., 2018).
To further increase the representation flexibility of such models, attention mechanisms (Bahdanau et al., 2015) have been introduced as an integral part of models employed for text classification (Yang et al., 2016).The attention module is trained to capture the dependencies that make significant contributions to the task, regardless of the distance between the elements in the sequence.It can thus provide complementary information to the distance-aware dependencies modeled by RNN/CNN.The increasing representation power of the attention mechanism comes with increased model complexity.
Alternatively, several recent studies show that the success of deep learning on text classification largely depends on the effectiveness of the word embeddings (Joulin et al., 2016;Wieting et al., 2016;Arora et al., 2017;Shen et al., 2018a).Particularly, Shen et al. (2018a) quantitatively show that the word-embeddings-based text classification tasks can have the similar level of difficulty regardless of the employed models, using the concept of intrinsic dimension (Li et al., 2018).Thus, simple models are preferred.As the basic building blocks in neural-based NLP, word embeddings capture the similarities/regularities between words (Mikolov et al., 2013;Pennington et al., 2014).This idea has been extended to compute embeddings that capture the semantics of word sequences (e.g., phrases, sentences, paragraphs and documents) (Le and Mikolov, 2014;Kiros et al., 2015).These representations are built upon various types of compositions of word vectors, ranging from simple averaging to sophisticated architectures.Further, they suggest that simple models are efficient and interpretable, and have the poten-arXiv:1805.04174v1[cs.CL] 10 May 2018 tial to outperform sophisticated deep neural models.
It is therefore desirable to leverage the best of both lines of works: learning text representations to capture the dependencies that make significant contributions to the task, while maintaining low computational cost.For the task of text classification, labels play a central role of the final performance.A natural question to ask is how we can directly use label information in constructing the text-sequence representations.

Our Contribution
Our primary contribution is therefore to propose such a solution by making use of the label embedding framework, and propose the Label-Embedding Attentive Model (LEAM) to improve text classification.While there is an abundant literature in the NLP community on word embeddings (how to describe a word) for text representations, much less work has been devoted in comparison to label embeddings (how to describe a class).The proposed LEAM is implemented by jointly embedding the word and label in the same latent space, and the text representations are constructed directly using the text-label compatibility.
Our label embedding framework has the following salutary properties: (i) Label-attentive text representation is informative for the downstream classification task, as it directly learns from a shared joint space, whereas traditional methods proceed in multiple steps by solving intermediate problems.(ii) The LEAM learning procedure only involves a series of basic algebraic operations, and hence it retains the interpretability of simple models, especially when the label description is available.(iii) Our attention mechanism (derived from the text-label compatibility) has fewer parameters and less computation than related methods, and thus is much cheaper in both training and testing, compared with sophisticated deep attention models.(iv) We perform extensive experiments on several text-classification tasks, demonstrating the effectiveness of our label-embedding attentive model, providing state-of-the-art results on benchmark datasets.(v) We further apply LEAM to predict the medical codes from clinical text.As an interesting by-product, our attentive model can highlight the informative key words for prediction, which in practice can reduce a doctor's burden on reading clinical notes.

Related Work
Label embedding has been shown to be effective in various domains and tasks.In computer vision, there has been a vast amount of research on leveraging label embeddings for image classification (Akata et al., 2016), multimodal learning between images and text (Frome et al., 2013;Kiros et al., 2014), and text recognition in images (Rodriguez-Serrano et al., 2013).It is particularly successful on the task of zero-shot learning (Palatucci et al., 2009;Yogatama et al., 2015;Ma et al., 2016), where the label correlation captured in the embedding space can improve the prediction when some classes are unseen.In NLP, labels embedding for text classification has been studied in the context of heterogeneous networks in (Tang et al., 2015) and multitask learning in (Zhang et al., 2017a), respectively.To the authors' knowledge, there is little research on investigating the effectiveness of label embeddings to design efficient attention models, and how to joint embedding of words and labels to make full use of label information for text classification has not been studied previously, representing a contribution of this paper.
For text representation, the currently bestperforming models usually consist of an encoder and a decoder connected through an attention mechanism (Vaswani et al., 2017;Bahdanau et al., 2015), with successful applications to sentiment classification (Zhou et al., 2016), sentence pair modeling (Yin et al., 2016) and sentence summarization (Rush et al., 2015).Based on this success, more advanced attention models have been developed, including hierarchical attention networks (Yang et al., 2016), attention over attention (Cui et al., 2016), and multi-step attention (Gehring et al., 2017).The idea of attention is motivated by the observation that different words in the same context are differentially informative, and the same word may be differentially important in a different context.The realization of "context" varies in different applications and model architectures.Typically, the context is chosen as the target task, and the attention is computed over the hidden layers of a CNN/RNN.Our attention model is directly built in the joint embedding space of words and labels, and the context is specified by the label embedding.
Several recent works (Vaswani et al., 2017;Shen et al., 2018b,c) have demonstrated that sim-ple attention architectures can alone achieve stateof-the-art performance with less computational time, dispensing with recurrence and convolutions entirely.Our work is in the same direction, sharing the similar spirit of retaining model simplicity and interpretability.The major difference is that the aforementioned work focused on self attention, which applies attention to each pair of word tokens from the text sequences.In this paper, we investigate the attention between words and labels, which is more directly related to the target task.Furthermore, the proposed LEAM has much less model parameters.

Preliminaries
Throughout this paper, we denote vectors as bold, lower-case letters, and matrices as bold, uppercase letters.We use for element-wise division when applied to vectors or matrices.We use • for function composition, and ∆ p for the set of one hot vectors in dimension p.
Given a training set S = {(X n , y n )} N n=1 of pair-wise data, where X ∈ X is the text sequence, and y ∈ Y is its corresponding label.Specifically, y is a one hot vector in single-label problem and a binary vector in multi-label problem, as defined later in Section 4.1.Our goal for text classification is to learn a function f : X → Y by minimizing an empirical risk of the form: where δ : Y × Y → R measures the loss incurred from predicting f (X) when the true label is y, where f belongs to the functional space F. In the evaluation stage, we shall use the 0/1 loss as a target loss: δ(y, z) = 0 if y = z, and 1 otherwise.
In the training stage, we consider surrogate losses commonly used for structured prediction in different problem setups (see Section 4.1 for details on the surrogate losses used in this paper).More specifically, an input sequence X of length L is composed of word tokens: Each token x l is a one hot vector in the space ∆ D , where D is the dictionary size.Performing learning in ∆ D is computationally expensive and difficult.An elegant framework in NLP, initially proposed in (Mikolov et al., 2013;Le and Mikolov, 2014;Pennington et al., 2014;Kiros et al., 2015), allows to concisely perform learning by mapping the words into an embedding space.The framework relies on so called word embedding: ∆ D → R P , where P is the dimensionality of the embedding space.Therefore, the text sequence X is represented via the respective word embedding for each token: A typical text classification method proceeds in three steps, endto-end, by considering a function decomposition 1(a): • f 0 : X → V, the text sequence is represented as its word-embedding form V, which is a matrix of P × L.
• f 1 : V → z, a compositional function f 1 aggregates word embeddings into a fixed-length vector representation z.
• f 2 : z → y, a classifier f 2 annotates the text representation z with a label.
A vast amount of work has been devoted to devising the proper functions f 0 and f 1 , i.e., how to represent a word or a word sequence, respectively.The success of NLP largely depends on the effectiveness of word embeddings in f 0 (Bengio et al., 2003;Collobert and Weston, 2008;Mikolov et al., 2013;Pennington et al., 2014).They are often pre-trained offline on large corpus, then refined jointly via f 1 and f 2 for task-specific representations.Furthermore, the design of f 1 can be broadly cast into two categories.The popular deep learning models consider the mapping as a "black box," and have employed sophisticated CNN/RNN architectures to achieve state-of-theart performance (Zhang et al., 2015;Yang et al., 2016).On the contrary, recent studies show that simple manipulation of the word embeddings, e.g., mean or max-pooling, can also provide surprisingly excellent performance (Joulin et al., 2016;Wieting et al., 2016;Arora et al., 2017;Shen et al., 2018a).Nevertheless, these methods only leverage the information from the input text sequence.

Model
By examining the three steps in the traditional pipeline of text classification, we note that the use of label information only occurs in the last step, when learning f 2 , and its impact on learning the representations of words in f 0 or word sequences in f 1 is ignored or indirect.Hence, we propose a new pipeline by incorporating label information in every step, as shown in Figure 1(b): y C Z / A K 3 q w n 6 8 V 6 t z 5 m p R W r 7 N k H f 2 B 9 / g A s m p r B < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " U Z h 7 e s s m c P / I H 1 + Q P m w p q U < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " u s e s s m c P / I H 1 + Q P m w p q U < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " u s e s s m c P / I H 1 + Q P m w p q U < / l a t e x i t > z < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 a r K u f y R 9 c N 9 Z i N s z n h Y 2 l 3 6 Q R H a I j d I L q 6 B w 1 0 A 1 q o h a i 6 B V 9 o m 8 P e e / e V 8 H 9 k l 9 r w Z v l H K C 5 K J R / A H p C t 3 E = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 a r K u f y R 9 c N 9 Z i N s z n h Y 2 l 3 6 Q R H a I j d I L q 6 B w 1 0 A 1 q o h a i 6 B V 9 o m 8 P e e / e V 8 H 9 k l 9 r w Z v l H K C 5 K J R / A H p C t 3 E = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 a r K u f y R 9 c N 9 Z i N s z n h Y 2 l 3 6 Q R H a I j d I L q 6 B w 1 0 A 1 q o h a i 6 B V 9 o m 8 P e e / e V 8 H 9 k l 9 r w Z v l H K C 5 K J R / A H p C t 3 E = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Y J P S 9 0 y C Z / A K 3 q w n 6 8 V 6 t z 5 m p R W r 7 N k H f 2 B 9 / g A s m p r B < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " U Z h 7 y C Z / A K 3 q w n 6 8 V 6 t z 5 m p R W r 7 N k H f 2 B 9 / g A s m p r B < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " U Z h 7 • f : Besides embedding words, we also embed all the labels in the same space, which act as the "anchor points" of the classes to influence the refinement of word embeddings.
• f : The compositional function aggregates word embeddings into z, weighted by the compatibility between labels and words.
• f : The learning of f 2 remains the same, as it directly interacts with labels.
Under the proposed label embedding framework, we specifically describe a label-embedding attentive model.

Joint Embeddings of Words and Labels
We propose to embed both the words and the labels into a joint space i.e., ∆ D → R P and Y → R P .The label embeddings are where K is the number of classes.
A simple way to measure the compatibility of label-word pairs is via the cosine similarity where Ĝ is the normalization matrix of size K×L, with each element obtained as the multiplication of 2 norms of the c-th label embedding and l-th word embedding: ĝkl = c k v l .
To further capture the relative spatial information among consecutive words (i.e., phrases1 ) and introduce non-linearity in the compatibility measure, we consider a generalization of (2).Specifically, for a text phase of length 2r + 1 centered at l, the local matrix block G l−r:l+r in G measures the label-to-token compatibility for the "label-phrase" pairs.To learn a higher-level compatibility stigmatization u l between the l-th phrase and all labels, we have: where W 1 ∈ R 2r+1 and b 1 ∈ R K are parameters to be learned, and u l ∈ R K .The largest compatibility value of the l-th phrase wrt the labels is collected: Together, m is a vector of length L. The compatibility/attention score for the entire text sequence is: where the l-th element of SoftMax is .
The text sequence representation can be simply obtained via averaging the word embeddings, weighted by label-based attention score: Relation to Predictive Text Embeddings Predictive Text Embeddings (PTE) (Tang et al., 2015) is the first method to leverage label embeddings to improve the learned word embeddings.We discuss three major differences between PTE and our LEAM: Training Objective The proposed joint embedding framework is applicable to various text classification tasks.We consider two setups in this paper.For a learned text sequence representation , where f 2 is defined according to the specific tasks: • Single-label problem: categorizes each text instance to precisely one of K classes, y ∈ where CE(•, •) is the cross entropy between two probability vectors, and • Multi-label problem: categorizes each text instance to a set of K target labels {y k ∈ ∆ 2 |k = 1, • • • , K}; there is no constraint on how many of the classes the instance can be assigned to, and where , and z nk is the kth column of z n .
To summarize, the model parameters θ They are trained endto-end during learning.{W 1 , b 1 } and {W 2 , b 2 } are weights in f 1 and f 2 , respectively, which are treated as standard neural networks.For the joint embeddings {V, C} in f 0 , the pre-trained word embeddings are used as initialization if available.

Learning & Testing with LEAM
Learning and Regularization The quality of the jointly learned embeddings are key to the model performance and interpretability.Ideally, we hope that each label embedding acts as the "anchor" points for each classes: closer to the word/sequence representations that are in the same classes, while farther from those in different classes.To best achieve this property, we consider to regularize the learned label embeddings c k to be on its corresponding manifold.This is imposed by the fact c k should be easily classified as the correct label y k : where f 2 is specficied according to the problem in either ( 7) or ( 8).This regularization is used as a penalty in the main training objective in ( 7) or ( 8), and the default weighting hyperparameter is set as 1.It will lead to meaningful interpretability of learned label embeddings as shown in the experiments.
Interestingly in text classification, the class itself is often described as a set of E words {e i , i = 1, • • • , E}.These words are considered as the most representative description of each class, and highly distinguishing between different classes.For example, the Yahoo!Answers Topic dataset (Zhang et al., 2015) contains ten classes, most of which have two words to precisely describe its class-specific features, such as "Computers & Internet", "Business & Finance" as well as "Politics & Government" etc.We consider to use each label's corresponding pre-trained word embeddings as the initialization of the label embeddings.For the datasets without representative class descriptions, one may initialize the label embeddings as random samples drawn from a standard Gaussian distribution.
Testing Both the learned word and label embeddings are available in the testing stage.We clarify that the label embeddings C of all class candidates Y are considered as the input in the testing stage; one should distinguish this from the use of groundtruth label y in prediction.For a text sequence X, one may feed it through the proposed pipeline for prediction: (i) f 1 : harvesting the word embeddings V, (ii) f 2 : V interacts with C to obtain G, pooled as β, which further attends V to derive z, and (iii) f 3 : assigning labels based on the tasks.To speed up testing, one may store G offline, and avoid its online computational cost.

Model Complexity
We compare CNN, LSTM, Simple Word Embeddings-based Models (SWEM) (Shen et al., 2018a) and our LEAM wrt the parameters and computational speed.For the CNN, we assume the same size m for all filters.Specifically, h represents the dimension of the hidden units in the LSTM or the number of filters in the CNN; R denotes the number of blocks in the Bi-BloSAN; P denotes the final sequence representation dimension.Similar to (Vaswani et al., 2017;Shen et al., 2018a), we examine the number of compositional parameters, computational complexity and sequential steps of the four methods.
As shown in Table 1, both the CNN and LSTM have a large number of compositional parameters.Since K m, h, the number of parameters in our models is much smaller than for the CNN and LSTM models.For the computational complexity, our model is almost same order as the most simple SWEM model, and is smaller than the CNN or LSTM by a factor of mh/K or h/K.

Experimental Results
Setup We use 300-dimensional GloVe word embeddings Pennington et al. (2014) as initialization for word embeddings and label embeddings in our model.Out-Of-Vocabulary (OOV) words are initialized from a uniform distribution with range [−0.01, 0.01].The final classifier is implemented as an MLP layer followed by a sigmoid or softmax function depending on specific task.We train our model's parameters with the Adam Optimizer (Kingma and Ba, 2014), with an initial learning rate of 0.001, and a minibatch size of 100.Dropout regularization (Srivastava et al., 2014) is employed on the final MLP layer, with dropout rate 0.5.The model is implemented using Tensorflow and is trained on GPU Titan X. The

Classification on Benchmark Datasets
We test our model on the same five standard benchmark datasets as in (Zhang et al., 2015).The summary statistics of the data are shown in Table 2, with content specified below: • AGNews: Topic classification over four categories of Internet news articles (Del Corso et al., 2005) composed of titles plus description classified into: World, Entertainment, Sports and Business.
• Yelp Review Full: The dataset is obtained from the Yelp Dataset Challenge in 2015, the task is sentiment classification of polarity star labels ranging from 1 to 5.
• Yelp Review Polarity: The same set of text reviews from Yelp Dataset Challenge in 2015, except that a coarser sentiment definition is considered: 1 and 2 are negative, and 4 and 5 as positive.
• Yahoo!Answers Topic: Topic classification over ten largest main categories from Yahoo! Answers Comprehensive Questions and Answers version 1.0, including question title, question content and best answer.
Hyper-parameter Our method has an additional hyperparameter, the window size r to define the length of "phase" to construct the attention.Larger r captures long term dependency, while smaller r enforces the local dependency.We study its impact in Figure 2(c).The topic classification tasks generally requires a larger r, while sentiment classification tasks allows relatively smaller r.One may safely choose r around 50 if not finetuning.We report the optimal results in Table 3.

Representational Ability
Label embeddings are highly meaningful To provide insight into the meaningfulness of the learned representations, in Figure 3  Interpretability of attention Our attention score β can be used to highlight the most informative words wrt the downstream prediction task.We visualize two examples in Figure 4(a) for the Yahoo dataset.The darker yellow means more important words.The 1st text sequence is on the topic of "Sports", and the 2nd text sequence is "Entertainment".The attention score can correctly detect the key words with proper scores.

Applications to Clinical Text
To demonstrate the practical value of label embeddings, we apply LEAM for a real health care scenario: medical code prediction on the Electronic Health Records dataset.A given patient may have multiple diagnoses, and thus multi-label learning is required.
contains text and structured records from a hospital intensive care unit.Each record includes a variety of narrative notes describing a patients stay, including diagnoses and procedures.They are accompanied by a set of metadata codes from the International Classification of Diseases (ICD), which present a standardized way of indicating diagnoses/procedures.To compare with previous work, we follow (Shi et al., 2017;Mullenbach et al., 2018), and preprocess a dataset consisting of the most common 50 labels.It results in 8,067 documents for training, 1,574 for validation, and 1,730 for testing.
To quantify the prediction performance, we follow (Mullenbach et al., 2018) to consider the micro-averaged and macro-averaged F1 and area under the ROC curve (AUC), as well as the precision at n (P@n).Micro-averaged values are calculated by treating each (text, code) pair as a separate prediction.Macro-averaged values are calculated by averaging metrics computed per-label.P@n is the fraction of the n highestscored labels that are present in the ground truth.
The results are shown in Table 5. LEAM provides the best AUC score, and better F1 and P@5 values than all methods except CNN.CNN consistently outperforms the basic Bi-GRU architecture, and the logistic regression baseline performs worse than all deep learning architectures.We emphasize that the learned attention can be very useful to reduce a doctor's reading burden.As shown in Figure 4(b), the health related words are highlighted.

Conclusions
In this work, we first investigate label embeddings for text representations, and propose the label-embedding attentive models.It embeds the words and labels in the same joint space, and measures the compatibility of word-label pairs to attend the document representations.The learning framework is tested on several large standard datasets and a real clinical text application.Compared with the previous methods, our LEAM algorithm requires much lower computational cost, and achieves better if not comparable performance relative to the state-of-the-art.The learned attention is highly interpretable: highlighting the most informative words in the text sequence for the downstream classification task.
y H 2 a 3 W / 4 R e F / 4 J m C e q o r F a / 9 t S N F E 0 F S E s 5 M a b T 9 B P b y 4 i 2 j H J w z 6 Y G E k J v y Q A 6 D k o i w P S y w v Q I b z s m w r H S 7 k i L C / b n R E a E y d d 0 n c 7 V j f m t 5 e R / W i e 1 8 U E v Y z J J L U g 6 / i h O O b Y K 5 w n i i G m g l g 8 d I F Q z t y u m N 0 Q T a l 3 O V R d C 8 7 f l v y D Y b R w 2 / P O 9 + t F x m U Y F b a I t t I O a a B 8 d o V P U Q g G i 6 A E 9 o z f 0 7 j 1 6 r 9 6 H 9 z l u n f L K m Q 0 0 U d 7 X N 5 I 5 r W k = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Y J P S 9 0 y H 2 a 3 W / 4 R e F / 4 J m C e q o r F a / 9 t S N F E 0 F S E s 5 M a b T 9 B P b y 4 i 2 j H J w z 6 Y G E k J v y Q A 6 D k o i w P S y w v Q I b z s m w r H S 7 k i L C / b n R E a E y d d 0 n c 7 V j f m t 5 e R / W i e 1 8 U E v Y z J J L U g 6 / i h O O b Y K 5 w n i i G m g l g 8 d I F Q z t y u m N 0 Q T a l 3 O V R d C 8 7 f l v y D Y b R w 2 / P O 9 + t F x m U Y F b a I t t I O a a B 8 d o V P U Q g G i 6 A E 9 o z f 0 7 j 1 6 r 9 6 H 9 z l u n f L K m Q 0 0 U d 7 X N 5 I 5 r W k = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Y J P S 9 0 y H 2 a 3 W / 4 R e F / 4 J m C e q o r F a / 9 t S N F E 0 F S E s 5 M a b T 9 B P b y 4 i 2 j H J w z 6 Y G E k J v y Q A 6 D k o i w P S y w v Q I b z s m w r H S 7 k i L C / b n R E a E y d d 0 n c 7 V j f m t 5 e R / W i e 1 8 U E v Y z J J L U g 6 / i h O O b Y K 5 w n i i G m g l g 8 d I F Q z t y u m N 0 Q T a l 3 O V R d C 8 7 f l v y D Y b R w 2 / P O 9 + t F x m U Y F b a I t t I O a a B 8 d o V P U Q g G i 6 A E 9 o z f 0 7 j 1 6 r 9 6 H 9 z l u n f L K m Q 0 0 U d 7 X N 5 I 5 r W k = < / l a t e x i t > V < l a t e x i t s h a 1 _ b a s e 6 4 = "

Figure 1 :
Figure 1: Illustration of different schemes for document representations z.(a) Much work in NLP has been devoted to directly aggregating word embedding V for z.(b) We focus on learning label embedding C (how to embed class labels in a Euclidean space), and leveraging the "compatibility" G between embedded words and labels to derive the attention score β for improved z.Note that ⊗ denotes the cosine similarity between C and V.In this figure, there are K=2 classes.

Figure 2 :Figure 3 :
Figure 2: Comprehensive study of LEAM, including convergence speed, performance vs proportion of labeled data, and impact of hyper-parameter
Specifically, we note that the text embedding in PTE is similar with a very special case of LEAM, when our window size r = 1 and attention score β is uniform.As shown later in Figure 2(c) of the experimental results, LEAM can be significantly better than the PTE variant.

Table 1 :
Comparisons of CNN, LSTM, SWEM and our model architecture.Columns correspond to the number of compositional parameters, computational complexity and sequential operations code to reproduce the experimental results is at https://github.com/guoyinwang/LEAM

Table 2 :
Summary statistics of five datasets, including the number of classes, number of training samples and number of testing samples.

Table 3 :
Test Accuracy on document classification tasks, in percentage.We ran Bi-BloSAN using the authors' implementation; all other results are directly cited from the respective papers.

Table 4 :
Comparison of model size and speed.