Text Emotion Distribution Learning from Small Sample: A Meta-Learning Approach

Text emotion distribution learning (EDL) aims to develop models that can predict the intensity values of a sentence across a set of emotion categories. Existing methods based on supervised learning require a large amount of well-labelled training data, which is difficult to obtain due to inconsistent perception of fine-grained emotion intensity. In this paper, we propose a meta-learning approach to learn text emotion distributions from a small sample. Specifically, we propose to learn low-rank sentence embeddings by tensor decomposition to capture their contextual semantic similarity, and use K-nearest neighbors (KNNs) of each sentence in the embedding space to generate sample clusters. We then train a meta-learner that can adapt to new data with only a few training samples on the clusters, and further fit the meta-learner on KNNs of a testing sample for EDL. In this way, we effectively augment the learning ability of a model on the small sample. To demonstrate the performance, we compare the proposed approach with state-of-the-art EDL methods on a widely used EDL dataset: SemEval 2007 Task 14 (Strapparava and Mihalcea, 2007). Results show the superiority of our method on small-sample emotion distribution learning.


Introduction
Analyzing emotions in text automatically is an important topic (Yadollahi et al., 2017) with widely used applications such as classifying e-commerce products reviews (Rao et al., 2016) and developing emotionally intelligent chatbots for healthcare (Fadhil and Gabrielli, 2017), to name a few. Text emotion analysis aims to recognize writers' emotional states towards particular topics or subjects (Yadollahi et al., 2017), and has attracted considerable research efforts in the last few decades (Canales and Martínez-Barco, 2014;Abdul-Mageed and Ungar, 2017;Yu et al., 2018;Zhang et al., 2018).
Existing works cast text emotion analysis into three types of tasks: single label learning (SLL), multi-label learning (MLL), and label distribution learning (LDL). In SLL, one particular name in an emotion category (referred as label in the following) is predicted (Abdul-Mageed and Ungar, 2017), e.g., joyful, angry, or sad. However, due to the correlation among different emotions, one sentence can potentially contain multiple different emotions (Yu et al., 2018). Therefore, a more practical way is to assign multiple labels to a sentence (Yu et al., 2018), namely, MLL. Label distribution learning, which is called emotion distribution learning (EDL) in emotion mining (Zhou et al., 2016), goes a step further and assigns one intensity value to each emotion, which is necessary to encode fine-grained information for comparing strength of different emotions, etc.
In general, there are three approaches to text emotion analysis tasks: lexicon-based (Staiano and Guerini, 2014), learning-based (Zhang et al., 2018), and the combination of both methods (Mudinas et al., 2012). In the lexicon-based case, one needs to collect an emotion lexicon corpus, e.g., WordNet-Affect (Strapparava and Valitutti, 2004), SentiWordNet (Esuli and Sebastiani, 2006), and applies different counting methods to aggregate the occurrences of words associated with various emotions (Staiano and Guerini, 2014). Learningbased methods, in contrast, mostly frame emotion analysis as a supervised learning problem, which requires annotated text data with emotion labels and extraction of proper sentence features. Learning-based approaches alone, or combined with lexicon-based methods, usually produce state-of-the-art (SOTA) performance (Mudinas et al., 2012;Zhang et al., 2018), and thus are widely used nowadays.
However, learning-based methods usually demand a large amount of annotated data to train models, which has become one of the performance bottlenecks. EDL in particular aims to decode fine-grained composition and magnitude of emotions in text, the human perception of which can be highly subtle and personal (Volkova et al., 2010). It is difficult, if not impossible, to collect a largescale emotion distribution dataset with consistent, clean human labels. Therefore, developing techniques to learn from a small sample is critical for the practicality of emotion distribution analysis.
In this paper, based on meta-learning (Vilalta and Drissi, 2002;Finn et al., 2017), we propose an efficient approach to learn text emotion distribution from a small sample. To make the most of a small labeled dataset, we propose to use the Knearest semantically similar neighbors (KNNs) of each training sample to cluster the training data, and train a meta-learner that can adapt to new testing data with only a few samples on the clusters. We can then fit the meta-learner on KNNs of each testing sample. Learning semantic similarity of sentences usually requires a large amount of data (Le and Mikolov, 2014). We propose to learn lowrank embeddings of sentences by tensor decomposition to capture their contextual semantic similarity, which works well regardless of the size of documents (Hosseinimotlagh and Papalexakis, 2018). We evaluate the proposed approach on a widely used text emotion distribution dataset: SemEval 2007 Task 14 (Strapparava and Mihalcea, 2007), and show that it outperforms the existing SOTA methods for small sample EDLs.
The contributions of this paper are: 1) we propose a novel meta-learning framework to learn text emotion distribution from a small sample; 2) we propose to learn low-rank embeddings of sentences by tensor decomposition to find semantically similar neighbors for training and adapting a meta-learner.

Related Works
We briefly review three related areas that motivate this work, including emotion distribution learning, text representation learning, and learning from a small sample.

Emotion Distribution Learning
Emotion distribution learning of text (Zhou et al., 2016;Zhang et al., 2018) is a recently proposed task that tries to predict the intensity values of a sentence across a set of emotion categories. Such information is important for understanding the fine-grained emotion information (Zhou et al., 2016;Zhang et al., 2018). For example, one sentence can usually invoke multiple emotional states with different levels, and existing approaches of SLL and MLL are inadequate for capturing multilabel, multi-intensity information. LDL is more suitable for such scenarios. In general, LDL methods can be classified into three categories: problem transformation, algorithm adaption, and specialized algorithms (Geng and Ji, 2013). PT-Bayes and PT-SVM are typical problem transformation methods that transform the LDL problem into an SLL problem, and use Bayes classifier and SVM to predict label distribution, respectively. AA-KNN and AA-BP are algorithm adaption methods that extend K-nearest neighbors (KNNs) and back-propagation (BP) neural networks. SA-LDSVR, SA-IIS, SA-BFGS, SA-CPNN are specialized LDL algorithms that directly parametrize and optimize LDL objectives (Geng and Ji, 2013). In terms of EDL, two different models have been proposed recently, including a maximum entropy model (Zhou et al., 2016) and a convolutional neural network model (Zhang et al., 2018), and the latter one is the SOTA.

Text Representation Learning
Learning numerical representation of text is usually the first step for learning-based emotion analysis, which can be categorized at two levels: word/phrase and sentence/document. At the word/phrase level, for example, Mikolov et al. (2013) propose word2vec which uses a three-layer neural network, i.e., the input layer, the projection layer, and the output layer, to learn word vectors from a large corpora in an unsupervised manner. The projection layer maps one-hot encoded input to a low-dimension vector, which is used to predict a word in the output layer. The learned word vectors of the projection layer are shown to be an effective word representation. At the sentence/document level, similar to word2vec, doc2vec (Le and Mikolov, 2014) treats sentences and context words in the same way in the input layer, and uses a three-layer neural network to learn the vector representation of sentences. Similarly, the skip-thought vector method (Kiros et al., 2015) uses sentences directly for prediction tasks, and ignores context words used in doc2vec. Recent works show that using attention mechanism, transformer in particular (Vaswani et al., 2017), is an effective way to learn universal sentence representation with unlabeled text, such as BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019). There are other ways to improve the discriminative ability of the learned features through supervised learning, such as task transfer methods like InferSent (Conneau et al., 2017).

Learning from A Small Sample
The success of machine learning models, especially deep learning models, heavily depends on a large amount of labeled data. But acquiring training data is usually difficult and expensive (Johnson et al., 2018). Therefore, considerable research efforts on small sample learning are emerging recently (Lake et al., 2015;Shu et al., 2018). There are generally two approaches for learning from a small sample (Shu et al., 2018): concept learning and experience learning. Concept learning means analyzing the structure of example samples by imitating human abilities like imagination, synthesis, and analysis (Lake et al., 2015;Shu et al., 2018). Experience learning means augmenting or transferring learning experience from other data or models, including data augmentation, model fine-tuning, model compression, and metalearning (Shu et al., 2018). Meta-learning in particular is an effective way of augmenting the experience learning ability (Thrun and Pratt, 1998;Vilalta and Drissi, 2002;Finn et al., 2017).

Meta-Learning
Meta-learning aims to increase the learning performance through experience sharing among tasks (Thrun and Pratt, 1998;Vilalta and Drissi, 2002;Finn et al., 2017). One kind of meta-learning is to train a meta-learner to update the parameters or rules of the learning models (Bengio et al., 1991). However, the complexity of this approach is usually high (Bengio et al., 1991;Finn et al., 2017). Another approach is to train a memory neural network, such as recurrent neural networks (Santoro et al., 2016), to keep a record of the different experiences among different tasks. In particular, Finn et al. (2017) find that it is possible to optimize a learner to adapt new tasks quickly only by gradient descent. In this approach, no additional parameters are needed. Recently, researchers have also extended this method to unsupervised learn-ing (Hsu et al., 2019) and online learning scenarios . Meta-learning has been used for few-shot image recognition , agent navigation (Finn et al., 2017), low-resource machine translation (Gu et al., 2018), query generation (Huang et al., 2018), and so on.
Existing EDL methods are designed with traditional machine learning paradigms (Geng and Ji, 2013;Zhou et al., 2016;Zhang et al., 2018), which may not work well on a small sample. Due to the difficulty of annotating emotion distribution data, developing techniques of small sample EDLs is thus critical for text emotion analysis. Transferring experience with pre-trained models (Shu et al., 2018) can be an effective way to boost natural language tasks. But such methods may need an additional large training corpora, and fine-tuning models on specific domains is non-trivial (Le and Mikolov, 2014). In this paper, we are interested in how to only use the labeled data for small sample EDLs (Shu et al., 2018).

Emotion Distribution Learning (EDL)
We denote one sample as (s, y), where s is a sentence or document, y = (y 1 , y 2 , . . . , y C ) is its emotion distribution label, y j denotes the intensity value of the j-th emotion, C is the number of discrete emotions, and ∑ C j y j = 1. For EDL, we expect to train a model f ∶ S ↦ Y that can accurately map any sentence s in semantic space S to its entailed emotion distribution y in emotion distribution space Y.

Given a set of training samples
, we are interested in how to train a model f that can predict emotion distribution effectively on a set of testing samples where M is the number of training samples and is a small number (e.g., 50), N is the number of testing samples.
For EDL from a small sample, we adopt the experience learning approach (Shu et al., 2018) by considering: 1) how to augment learning ability on a small number of training samples, and 2) how to learn semantic similarity on D train ∪D test so that we can use K-nearest neighbors in D train to predict the emotion distribution of a testing sample in D test .

Method
As shown in Figure 1, to augment the learning ability on a small sample, we first partition the training data by finding K-nearest neighbors (KNNs) of each sample based on semantic similarity, and treat each K samples as a cluster. We then train a meta-learner that is optimized to adapt to new data with only a few training samples on the clusters. During testing, we fit the meta-learner with KNNs of a testing sample, where KNNs are found in the training data.

Meta-Learning Preliminary
A meta-learning algorithm consists of a set of training tasks , and a model f θ 1 , where M and N are the numbers of the training and testing tasks separately, and θ is the parameters of f . Each task T consists of a set of training samples called supporting set S and a set of testing samples called querying set Q, namely, T = {S, Q}. The goal of meta-learning is to train a learner f θ using T train so that given a new testing task T in T test , it can perform well on the query set Q of T by fine-tuning it with only a few samples in the support set S.

EDL via Meta-Learning
We propose a meta-learning framework for EDL, including task generation from a small sample, meta-learner training, and meta-learner adaption for predicting testing samples.

Task Generation
For each sample in D train , we first find its KNNs by semantic similarity, and treat the K samples as a training task. Therefore, samples in each task can be overlapped, and we use this way to increase the number of tasks. For each task, we randomly select K 2 samples as the support set, and use the 1 To be consistent with the existing literature, we refer to the model as the learner in the rest of the paper. remaining ones as the query set, i.e., the sizes of S and Q are both K 2. Therefore, for M samples in D train , we can generate M training tasks T train . The intuition is that semantically similar sentences are more likely to have similar emotion distribution patterns.  (1), and for each sample in D train , find its KNNs to get T train ; // Meta-Learner Training 3 initialize the meta-learner f with parameter θ; 4 for iter in 1, 2, . . . , niter do 5 sample L tasks from T train randomly; 6 for l in 1, 2, . . . , L do Low-rank embedding by tensor decomposition Contextual patterns of words can be used to measure semantic similarity for emotion analysis (Staiano and Guerini, 2014;Mikolov et al., 2013). Traditional embedding approaches, e.g., doc2vec, usually require a large amount of data (Le and Mikolov, 2014). We propose to use low-rank embeddings of sentences mined by tensor decomposition, which can obtain text embeddings regardless of the corpus size (Hosseinimotlagh and Papalexakis, 2018). As shown in Figure 2, considering D = D train ∪ D test , we first build a vocabulary for it, namely, w 1 , w 2 , . . . , w V , where V is the number of words. For each sentence s in D, we count the word-word co-occurrence in a small window H, and build a binary matrix V s ∈ [0, 1] V ×V . In particular, V s (i, j) = 1 indicates word w i and w j co-occurs in s within a small window H at least once. In this way, we can capture the semantic patterns of a sentence in V s . In addition, this approach can also deal with negation issue to some extent, which is important for sentiment analysis (Reitan et al., 2015). Because negation sentences usually have negation words, they will have a different word-word cooccurrence pattern with that of normal sentences, which will result in different embeddings. Therefore, the tensor decomposition method may put sentences with and without negation words into different clusters. Afterwards, we stack all V s as a three-dimensional tensor V ∈ [0, 1] V ×V ×D , where D = M + N . We adopt the CANDE-COMP/PARAFAC tensor decomposition method (Sidiropoulos et al., 2017) to find an approxima-tionV of V:V such that the Frobenius norm V − V F is minimal, where w r ∈ R V , s r ∈ R D , R is the rank, and ⊗ is the outer product, namely, w r ⊗ w r ⊗ s r being a three-dimensional tensor, and w r ⊗ w r ⊗ s r (i, j, k) = w r (i) ⋅ w r (j) ⋅ s r (k). We use both training and testing datasets for embedding, which follows the general practice of previous literature (Zhou et al., 2016;Zhang et al., 2018). In such a case, it is possible to infer a complete vocabulary of the corpus by making use of both the training and testing dataset. If some testing data are not available when the tensor is built, there exist tensor decomposition methods for dealing with streaming data scenarios (Gujral et al., 2018), namely, the testing data can be updated continuously over time. Our framework can easily utilize such advanced methods to generalize to sentences with out-of-vocabulary words.
With the tensor decomposition, we can find low-rank embeddings of sentences that capture the similarity of contextual patterns (Hosseinimotlagh and Papalexakis, 2018). In particular, C = [s 1 , s 2 , . . . , s R ] ∈ R D×R , where the s-row of C = [c T 1 , c T 2 , . . . , c T D ] T , c T s , is the embedding vector of the sentence s. We measure the similarity of two sentences i, j by Euclidean distance: where c(r) denotes the r-th element of c.

Meta-Learner Training
We can train a meta-learner on the generated T train .
Learner. We use a Convolutional Neural Network (CNN) model as the basic learner, which has shown good performance on text classification (Kim, 2014) and text emotion distribution learning (Zhang et al., 2018). The CNN learner has the same architecture with (Zhang et al., 2018). In particular, given the input sentence s, we first stack a matrix X with the word vector of each word (Mikolov et al., 2013). Then three convolution layers with kernel sizes 3, 4, 5 are performed on X separately, and we then concatenate them together. A fully connected layer is then used, following with a soft-max operation to get the final emotion probability prediction. We use Kullback-Leibler (K-L) divergence L K-L as the objective to measure the distance of predicted distributionŷ and the true distributionŷ: where K is the number of training samples. Similarly to (Zhang et al., 2018), we also optimize the classification accuracy of the dominant emotion by a cross-entropy (CE) objective, which is shown to improve EDL performance. In particular, 1(y j ) = 1 if y j is the maximal value of y, otherwise 1(y j ) = 0. The final objective is a weighted sum of L KL and L CE : where γ is the weight factor, 0 ≤ γ ≤ 1.
Training meta-learner. To train the learner f θ , we optimize the parameters θ of the learner f such that a small number of gradient steps on a new task will produce maximally effective behavior on that task, namely, where T i is a randomly sampled task in T train , θ ′ i is the trained optimal parameters on task T i . T i (Q) denotes that the loss is computed on the query set Q. Equation (3) is optimized via stochastic gradient descent (SGD), where β is the meta-learning rate. Similarly, SGD is used to compute where T i (S) denotes that the loss is computed on S, and α is the learning rate.

Adapting Meta-Learner
The meta-learner f θ is trained to adapt to new data with only a few semantically similar training samples. Therefore, given a testing sample s, we first find its K-nearest neighbors K s in the training data D train , and then adapt f θ on K s by SGD: (4) Here we use the learning rate α to be consistent with the meta-training procedure.
The overall algorithm is shown in Algorithm 1, where α is the learning rate, β is the metalearning rate, γ is the weight value of the distribution and classification losses, H is the window size for word-word co-occurrence counting, R is the rank of tensor decomposition, K is the number of nearest neighbors, niter is the number of meta-training iterations, and L is the number of tasks for each round of meta-training.

Experiment
To evaluate the proposed approach for learning emotion distribution from a small sample, we conduct intensive experiments on SemEval 2007 Task 14 (Strapparava and Mihalcea, 2007). To the best of our knowledge, this is the only publicly available English dataset with emotion distributions labeled by humans.

Dataset
SemEval 2007 Task 14 contains 1250 sentences of news headlines with 6 emotion intensities (anger, disgust, fear, joy, sadness, and surprise) labeled by humans. Each intensity value is a non-negative value. We normalize the annotated scores to get emotion distribution labels using the same procedure with Zhang et al. (2018). In particular, given an original intensity tuple (l 1 , l 2 , . . . , l 6 ), we first calculate the sum of all values, l = ∑ 6 k=1 l k , and then normalize it to get a distribution as (l 1 l, l 2 l, . . . , l 6 l). If l = 0, then the distribution will be (1 6, 1 6, . . . , 1 6). It is possible to add another intensity value l 7 denoting the overall level of emotion to the original label; for example, if l 7 is 1, there is no emotion in the text, and if l 7 is 0, there is strong emotions. Similar to l 1 ∼ l 6 , l 7 can be annotated manually. We can then normalize (l 1 , l 2 , . . . , l 7 ) to get a new distribution. In this way, we can better model the cases of a sentence expressing very little emotion overall. For simplicity and fair comparison, we only use l 1 ∼ l 6 in this paper.

Experiment Protocal
Because there is no publicly available small dataset for text emotion distribution evaluation, we follow existing practices (Hosseinimotlagh and Papalexakis, 2018) to simulate a small training set by randomly selecting 10% of the samples and using the remaining ones for testing. We run 10-fold cross validation, and report the averaged results.
Baselines. We compare the proposed method (denoted as EDL-Meta) with several baseline methods, including the SOTA method EDL-CNN (Zhang et al., 2018), LDL methods, and document vectorization methods. EDL-CNN uses the basic learner of EDL-Meta, and can be seen as a special case of EDL-Meta when K = 0. Similar to Zhang et al. (2018), we also extract the penultimate layer of EDL-CNN as features, and fit several typical LDL methods: PT-Bayes, PT-SVM, AA-KNN, AA-BP, SA-LDSVR, SA-IIS, SA-BFGS, SA-CPNN for comparison. The basic principles of each method are briefly summarized in subsection 2.1. In addition, we apply three SOTA document vectorization methods -doc2vec (Le and Mikolov, 2014), InferSent (Conneau et al., 2017), and BERT (Devlin et al., 2019) -and use the extracted vectors to train a linear regressor to predict emotion distributions.

Evaluation Metrics
Evaluating the performance of distribution learning is challenging, because we need to measure the prediction results of fine-grained intensity values.
In (Geng and Ji, 2013), the authors propose six metrics, and suggest that each metric may reflect certain aspects of an algorithm. A good algorithm should perform well on most of them. The six metrics contain Euclidean ↓ , Sørensen ↓ , Squaredχ 2↓ ,  K-L ↓ , Fidelity ↑ , Intersection ↑ , and can be classified into two categories: distance metrics and similarity metrics (Geng and Ji, 2013). Distance metrics measure the distance between the predicted distribution and the true one, the smaller the better. Similarity metrics measure the similarity between the predicted distribution and the true one, the bigger the better. Here, ↑ means the bigger the better, ↓ means the smaller the better. The formulas for calculating all metrics are summarized in Table  2, where (q 1 , q 2 , . . . , q C ) is the predicted distribution, (p 1 , p 2 , . . . , p C ) is the true distribution, C is the number of emotions.

Implementation
We use 300 dimension word vectors trained on Google News from (Mikolov et al., 2013) as initial input. Because SemEval 2007 Task 14 is mostly News headlines, the pre-trained word vectors are closer to it. We implement our algorithm by PyTorch 2 . Owing to EDL-CNN not being open-sourced, we also implement it by Py-Torch. The implementation of doc2vec is adopted from gemsim 3 . BERT embedding (uncased base model, without fine-tuning) is adopted from an open-source implementation (Xiao, 2018). The implementations of other methods are downloaded from the original paper (Geng and Ji, 2013;Conneau et al., 2017). For EDL-CNN, we use the network architecture in the original paper, and set the learning rate 0.1, epoch number 25 (Kim, 2014).
In the same experiment setting with (Zhang et al., 2018), namely, 90% training data and 10% testing data of SemEval 2007, we can obtain 0.344 Euclidean ↓ , 0.330 Sørensen ↓ , 0.344 Squaredχ 2↓ , 0.437 K-L ↓ , 0.853 Fidelity ↑ , 0.670 Intersection ↑ , which are slightly better than the results reported in (Zhang et al., 2018). For fair comparison, we use the default parameters of all other methods, which are reported to get the best performance. For EDL-Meta, we empirically set α = 0.01, β = 0.1, L = 5, H = 5. γ is set to be 0.7 according to (Zhang et al., 2018), and we do not further tune it. 1000 epoches are used to train the meta-learner for both datasets. To find optimal K and R, we run a 5-fold cross validation grid search, and set K = 20, R = 5. The alternating least squares method is used to calculate the tensor decomposition, and we use the implementation in the Matlab tensor toolbox 4 . The kd tree method is used to find KNNs of a sample efficiently. We run all experiments on a desktop with Intel(R) Core(TM) i9-7900X CPU 3.30GHz, 64GB RAM, Nvidia GeForce GTX 1080 Ti (×2).

Results
The   form other methods by about 5% on most metrics. Although EDL-CNN can perform well with more training samples (Zhang et al., 2018), on the small sample, its performance is not as stable as EDL-Meta. Even simple KNN can get slightly better results than EDL-CNN on most metrics. Likewise, other methods cannot get stable results on all metrics. This indicates that EDL-Meta is more suitable for small sample EDLs.

Low-rank Embedding
To investigate the effectiveness of the tensor decomposition method, we use doc2vec (Le and Mikolov, 2014) and InferSent (Conneau et al., 2017) to train sentence vectors on D, and find KNNs of a sample by cosine similarity, denoted as doc2vec and InferSent, respectively. In addition, we also use randomly selected K-neighbors to train and test EDL-Meta, denoted as random. We denote our tensor decomposition method as tensor. The results are shown in Table 3. Compared to doc2vec, InferSent, and random, our method can get better results on all metrics. The doc2vec and InferSent embedding methods show a slightly lower performance, even worse than the random version. doc2vec usually assumes a large pool of training data (Le and Mikolov, 2014), while In-ferSent also does not work well for transferring to a task with a small training sample. The tensor decomposition method is more suitable for small sample learning.

Meta-Training and Adapting
To investigate the effect of the proposed metatraining and adapting procedures, we conduct another experiment. In particular, with the same ex- periment setting, we replace the meta-training procedure (denoted as Meta.) with a normal batch training (batch size 50) and the adaption procedure (denoted as Adap.) with a normal evaluation procedure, respectively. We can then get four versions of EDL-Meta, namely, without Meta. + without Adap., without Meta. + with Adap., with Meta. + without Adap., and with Meta. + with Adap. The first one is EDL-CNN in particular, and the last one is the normal EDL-Meta. In this way, we can further examine the importance of the proposed meta-training and adapting procedures. The results are shown in Table 4. We can see that with both the meta-training and adapting procedures, EDL-Meta performs more stably than other methods. In addition, the meta-training procedure plays a crucial role on boosting the performance, as the results of the third version are similar to the fourth version. Finally, without meta-training, only using adapting is meaningless, as the performance of the second version decreases a lot.

Label Percentage
In the same experiment setting, we further investigate the influence of different training/testing partition. We compare the performance of EDL-CNN and EDL-Meta under different training/testing partition. The results are shown in Figure 3. We can find that on small training sample, there is a clear gap between EDL-CNN and EDL-Meta. But with more training sample (>50%), EDL-CNN and EDL-Meta converge to similar results. Therefore, it is likely that the learning ability of EDL-Meta is limited by its basic learner. Further experiments on different learners deserve to investigate.

Conclusion
In this paper, we propose an efficient metalearning approach to learn text emotion distribution from a small sample. In addition, to find the K-nearest semantically similar neighbors, we propose to learn sentence embedding by tensor decomposition. Experiments on SemEval 2007 Task 14 shows that the proposed approach outperforms existing EDL methods on small sample emotion distribution learning. We further investigate the low-rank embedding method, the meta-training procedure, and the adaption procedure. We find that leveraging nearest semantically similar neighbors is an effective way for small sample EDLs. We also find that the meta-training procedure can be a good alternative to the normal batch training procedure, especially for small sample EDLs.