Hierarchical Attention Prototypical Networks for Few-Shot Text Classification

Most of the current effective methods for text classification tasks are based on large-scale labeled data and a great number of parameters, but when the supervised training data are few and difficult to be collected, these models are not available. In this work, we propose a hierarchical attention prototypical networks (HAPN) for few-shot text classification. We design the feature level, word level, and instance level multi cross attention for our model to enhance the expressive ability of semantic space, so it can highlight or weaken the importance of the features, words, and instances separately. We verify the effectiveness of our model on two standard benchmark few-shot text classification datasets—FewRel and CSID, and achieve the state-of-the-art performance. The visualization of hierarchical attention layers illustrates that our model can capture more important features, words, and instances. In addition, our attention mechanism increases support set augmentability and accelerates convergence speed in the training stage.


Introduction
The dominant text classification models in deep learning (Kim, 2014;Zhang et al., 2015a;Yang et al., 2016; require a considerable amount of labeled data to learn a large number of parameters. However, such methods may have difficulty in learning the semantic space in the case that only few data are available. Few-shot learning has became an effective approach to solve this challenge, it can train a neural network with a few parameters using few data but achieve good performance. A typical example of this approach is prototypical networks (Snell et al., 2017), which averages the vector of few support instances as the class prototype and computes distance between target query and each prototype, then classify the query to the nearest prototype's class. However, prototypical networks is rough and does not consider the adverse effects of various noises in the data, which weakens the discrimination and expressiveness of the prototype.
In this paper, we propose a hierarchical attention prototypical networks for few-shot text classification by using attention mechanism in three levels. For feature level attention, we use convolutional neural networks to get the feature scores which is different for various classes. For word level attention, we adopt an attention mechanism to learn the importance of each word hidden state in an instance. For instance level multi cross attention, with the help of multi cross attention between support set and target query, we can determine the importance of different instances in the same class and enable the model to get a more discriminative prototype of each class.
In the actual scenario, we apply HAPN on intention detection of our open domain chatbots with different character. If we create a chatbot for old people, the user intentions will focus on children, health or expectation, so we can define specific intentions and supply related responses. And because of only few data are needed, we can expand the number of classes quickly. The model helps chatbot to identify user intentions precisely, makes the dialogue process smoother, more knowledgeable and more controllable.
There are three main parts of our contribution: first of all, we propose a hierarchical attention prototypical networks for few-shot text classification, then we achieve state-of-the-art performance on FewRel and CSID datasets, and the experiments prove our model is faster and more extensible.

Text Classification
Text Classification is an important task in Natural Language Processing, and many models are proposed to solve it. The traditional methods mainly focus on feature engineerings such as bagof-words or n-grams (Wang and Manning, 2012) or SVMs (Tang et al., 2015). The neural network based methods like Kim (2014) applies convolutional neural networks for sentence classification. Then, Johnson and Zhang (2015) use a one-hot word order CNN, and Zhang et al. (2015b) apply a character level CNN. C-LSTM (Zhou et al., 2015) combines CNN and RNN for sentence representation and text classification. Yang et al. (2016) explore the hierarchical structure of documents classification, they use a GRU-based attention to build representations of sentences and another GRU-based attention to aggregate them into a document representation. But above supervised learning methods require large-scale labeled data and can't classify unseen classes.

Few-Shot Learning
Few-Shot Learning (FSL) aims to solve classification problems by training a classifier with few instances in each class, and it can apply to unseen classes. The early works aim to use transfer learning approaches, Caruana (1994) and Bengio (2011) adopt the target task from the pre-trained models. Then Koch et al. (2015) explore a method for learning siamese neural networks which employs an unique structure to rank similarity between inputs. Vinyals et al. (2016) use matching networks to map a small labeled support set and an unlabelled example to its label, and obviate the need for fine-tuning to adapt to new class types. Prototypical networks (Snell et al., 2017) learns a metric space in which the model can perform well by computing distance between query and prototype representations of each class and classify the query to the nearest prototype's class. Sung et al. (2018) propose a two-branch relation networks, which learns to compare query against few-shot labeled sample support data. Dual TriNet structure  can efficiently and directly augment multi-layer visual features to boost the few-shot classification.But all of the above works mainly concentrate on computer vision field, the research and applications in NLP field are extremely limited. Recently,  propose an adaptive metric learning approach that automatically determines the best weighted combination from a set of metrics obtained from meta-training tasks for a newly seen few-shot task such as intention classification, Han et al. (2018) present a relation classification dataset -FewRel, and adapt most recent state-of-the-art few-shot learning methods for it, Gao et al. (2019) propose a hybrid attention-based prototypical networks for noisy few-shot relation classification. However, these methods do not consider mining semantic information or reducing the impact of noise more precisely. And in most of the realistic settings, we may increase the number of instances gradually, so model capacity needs more attention.

Task Definition
In few-shot text classification task, our goal is to learn a function : G(D, S, x) → y. D is the labeled data, we divide D into three parts: D train , D validation , and D test , and each part has specific label space. We use D train to optimize parameters, D validation to select best hyper parameters, and D test to evaluate the model.
The "episode" training strategy that Vinyals et al. (2016) proposed has proved to be effective. For each training episode, we first sample a label set L from D train , then use L to sample the support set S and the query set Q, finally, we feed S and Q to the model and minimize the loss. If L includes N different classes and each class of S contains K instances, we call the target problem N -way K-shot learning. For this paper, we consider N = 5 or 10, and K = 5 or 10.
For exactly, in an episode, we are given a support set S S ={(x 1 1 , l 1 ), (x 2 1 , l 1 ), . . . , (x n 1 1 , l 1 ), ⋯, consists of n i text instances for each class l i ∈ L, x j i means it is the j support instance belonging to calss l i , and instance x j i includes T i,j words {w 1 , w 2 , . . . , w T i,j }.
Then x is an unlabeled instance of query set Q to classify, and y ∈ L is the output label followed by the prediction of G.

Model Overview
The overall architecture of the Hierarchical Attention Prototypical Networks is shown in Figure 1. We introduce different components in the following subsections: Instance Encoder Each instance in support set or query set will be first represented to a input vector by transforming each word into embeddings. Considering the lightweight and speed of the model, we achieve this part with one layer convolutional neural networks (CNN). For ease of comparison, its details are the same as Han et al. (2018) proposed. Hierarchical Attention In order to get more important information from rare data, we adopt a hi-erarchical attention mechanism. Feature level attention enhances or reduces the importance of different feature in each class, word level attention highlight the important words for meaning of the instance, and instance level multi cross attention can extract the important support instances for different query instances, these three attention mechanisms work together to improve the classification performance of our model. Prototypical Networks Prototypical networks compute a prototype vector as the representation of each class, and this vector is the mean vector of the embedded support instances belonging to its class. We compare the distance between all prototype vectors and a target query vector, then classify this query to the nearest one.

Instance Encoder
The instance encoder part consists of two layers: embedding layer and instance encoding layer.

Embedding Layer
Given an instance x = {w t , w 2 , . . . , w T } with T words. We use an embedding matrix W E ,w t = W E w t to embed each word to a vector where d is the word embedding dimension.

Encoding Layer
Following we apply a convolutional neural network Zeng et al. (2014) as encoding layer to get the hidden annotations of each word by a convolution kernel with the window size m Especially, if the word w t has a position embedding p t , we should concat w t and p t where ⊕ is a concatation, the h t will be as follow Then, we aggregate all h t to get the overall representation of instance x Finally, we define those two layers as a comprehensive function θ in this function are the networks parameters to be learned.

Prototypical Networks
The prototypical networks (Snell et al., 2017) has achieved excellent performance in few-shot image classification and few-shot text classification (Han et al., 2018;Gao et al., 2019) tasks respectively, so our model is based on prototypical networks and aims to get promotion. The fundamental idea of prototypical networks is simple but efficient: we can use a prototype vector c i as the representative feature of class l i , each prototype vector can be calculated by averaging all the embedded instances in its support set Then the probability distribution over the classes in L can be produced by a softmax function over distances between all prototypes vector and the target query q As Snell et al. (2017) mentioned, squared Euclidean distance is a reasonable choice, however, we will introduce a more effective method in section 4.4.1, which combines squared Euclidean distance with class feature scores, and achieves definite improvement.

Hierarchical Attention
We focus on sentence-level text classification in this work. The proposed model gets a feature scores vector and transfers the support set of each class into a vector representation, on which we build a classifier to perform few-shot text classification.

Feature Level Attention
Obviously, the same dimension belonging to different classes has different importance when we calculate the euclidean distance. In other words, some feature dimensions are more discriminative for distinguishing specific class in the feature level space, and other features are confusing and useless at the same time.
So we apply a CNN-based feature attention mechanism similar to Gao et al. (2019) proposed as a class feature extractor. It depends on all the instances in the support set of each class and will dynamiclly change with different classes. Given a support set S i ∈ R n i ×T ×d of class l i as the output of above instance encoder part we apply a max pooling layer over each instance in S i to get a new feature map S ci ∈ R n i ×d . Then we use three convolution layers to obtain λ i ∈ R d , which is the scores vector of class l i . The specific structure of above class feature extractor is shown in Table 1. layer name kernel size stride output size where q ′ is the query vector passed through the word level attention mechanism which will be introduced in the next subsection.

Word Level Attention
The importance of different words to the meanings of an instance is unequal, thus it is worth pointing out which words are useful and which words are useless. Therefore, we apply an attention mechanism (Yang et al., 2016) to get those important words and assemble them to compose a more informative instance vector s j , and the definitions are as follows s j = Firstly, the W w and b w followed by activation function tanh make up a MLP layer to transform h j t to the new hidden representation u j t . Immediately, we apply a dot product operation between u j t and a word level weight vector u w to compute similarity v j t as the importance weight of u j t . Then we use a softmax function to normalize v j t to α j t . Finally, we calculate the instance level vector s j through the weighted sum of α j t and h j t . As memory networks (Sukhbaatar et al., 2015) proposed, u w can help us to select the important words in each instance, it will be randomly initialized at the beginning of the training stage, and be optimized together with the networks parameters θ.

Instance Level Multi Cross Attention
The previous prototypical networks use the mean vector of support instances as the class prototype. Because of the diversity and lack of the support instances, the gap between each support vector and prototype maybe wide, meanwhile, different query instances can be expressed in several ways, so not every instance in a support set contributes equally to the class prototype when they face a target query instance. To highlight the importance of support instances which are useful clues to classify a query instance correctly, we propose a multi cross attention mechanism.
Given a support set S ′ i ∈ R n i ×d for class l i and a query vector q ′ ∈ R d , they are all encoded through the instance encoder and word level attention. We consider each support vector s j i in S ′ i has its own weight β j i to query q ′ . So the formula (8) will be rewritten as follow where we define r j i = β j i s j i as the weighted prototype vector and the definitions of β j i are as follows where f φ is a linear layer, ⋅ is element-wise absolute value and ⊙ is element-wise product, we use these two operation to get the difference information τ 1 and τ 2 between s j i and q ′ , then concatenate them all as the multi cross attention information mca, then f ϕ (⋅) is a linear layer, σ(⋅) is a tanh activation function, sum{⋅} means a sum operation of all elements in the vector. Finally, γ j i is the weight of j instance in support set s i , and we use a softmax function to nomalize it to β j i . Through the multi cross attention mechanism, the prototype can pay more attention to those query-related support instances and improve the capacity of support set.

Experiments
In this section, we will introduce the experiment results of our model. Firstly, we evaluate our model on FewRel dataset and CSID dataset, and achieve state-of-the-art results, our model outperforms the best baselines models by 1.11% and 1.64% respectively on 10 way 5 shot setting. Then we will show how our model works by case study and visualization of attention layers. We further demonstrate that the hierarchical attention increases the augmentability of support set and the convergence speed of the model.

Datasets
FewRel Few-Shot Relation Classification (Han et al., 2018) is a new large-scale supervised dataset 1 . It consists of 70000 instances on 100 relations derived from Wikipedia, and each relation includes 700 instances. It also marks the head and tail entities in each instance, and the average number of tokens is 24.99. FewRel has 64 relations for training, 16 relations for validation, and 20 relations for test separately. CSID Character Studio Intention Detection is a dataset extracted from a real-world open domain chatbot. In character studio platform, this chatbot should transform its character style sometime so it can adapt to different user group and environment, thus dialog query intention detection turns into an important task. CSID consists of 24596 instances for 128 intentions, and each intention includes 30 to 260 instances, the average number of tokens in each instance is 11.52. We use 80, 18 and 30 intentions for training, validation, and test respectively.

Baselines
Firstly, we compare our model with several traditional models such as Finetune and kNN, Then we compare our model with five state-of-the-art fewshot learning models based on neural networks, they are MetaN (Munkhdalai and Yu, 2017), GNN (Garcia and Bruna, 2018), SNAIL (Mishra et al., 2018), Proto (Snell et al., 2017) and PHATT (Gao et al., 2019) respectively.

Implementation details
We compare our models with seven baselines, and the implementation details are as follows.
For FewRel dataset, we cite the results reported by Snell et al. (2017) which includes Finetune, kNN, MetaN, GNN, and SNAIL, then we cite the results reported by Gao et al. (2019) which includes Proto and PHATT. For a fair comparison, in our model, we use the same word embeddings and hyperparameters of instance encoder as PHATT proposed. In detail, we use the Glove (Pennington et al., 2014) consisting of 6B tokens and 400K vocabulary as our initialized word representation, and each word has a 50 dimensions vector. In addition, the position embedding dimension of a word is 10, the max length of each instance is 40. Finally, we evaluate all models on 5 way 5 shot and 10 way 5 shot settings.
For CSID dataset, we implement all above seven baseline models and our models. we use the Baidu Encyclopedia  as our initialized word representation, it includes 745M tokens and 5422K vocabulary, and each word has a 300d dimensions vector, the max length of each instance is 20. Finally, we evaluate all models on 5 way 5 shot, 5 way 10 shot, 10 way 5 shot and 10 way 10 shot settings.
For the Finetune and kNN baselines, they learn the parameters on the support set with the CNN encoder. For the neural networks based baselines, we use the same hyper parameters as Han et al. (2018) proposed.
For our hierarchical attention prototypical networks, the window size of the CNN instance encoder is 3, the dimension of the hidden layer is 230, the learning rate is 0.1, the learning rate decay step is 3000 and the decay rate is 0.1. In addition, we train our model 12000 episodes and each episode consists of 20 classes.
In order to study the effects of different components, we refer to our models as HAPN-{FA,WA, IMCA}, FA indicates feature level attention, WA indicates word level attention and IMCA indicates instance level multi cross attention.

Results and analysis
The experimental accuracies on CSID and FewRel are shown in Tabel 2 and Table 4 respectively. In this subsection, we will show the effects of hierarchical attention and support set augmentability of three Proto-based models and the convergence speed comparison.

Effects of hierarchical attention
Benefit from hierarchical attention, our model achieves excellent performance. The case study of word level attention and instance level multi cross attention are shown in Table 3, this is a 2 way 3 shot task on FewRel dataset. The query instance is an instance of "mother" class in fact, and our model should classify it into "mother" class or "child" class. It is a difficult Class Word Attention IMCAS Support Set (1) mother Cherie Gil is the daughter of Filipino actors Eddie Mesa and Rosemarie Gil, and sister of fellow actors, Michael de Mesa and the late Mark Gil.
When they reachedadulthood, Pelias and Neleus found their mother Tyro and then killed her stepmother, Sidero, for having mistreated her.
It was here that the Queen Consort Jetsun Pema gave birth to a son on 5 February 2016, Jigme Namgyel Wangchuck.
(2) child In 1421 Mehmed died and his son Murad II refused to honour his father's obligations to the Byzantines.
Henry Norreys was a lifelong friend of Queen Elizabeth and was the father of six sons, who included Sir John Norreys, a famous English soldier.
Jim Henson and his son Brian were impressed enough with Barron's style to offer him a job directing the pilot episode of "The Storyteller".
task because of there are many similarities between the expressions of two classes. With the help of word level attention, we highlight the importance of the word "daughter", which appears in the query instance and the first support instance of class "mother" at the same time, then this support instance get the highest attention score and contributes more to the prototype vector of "mother" class, finally our model can classify the query instance into the correct class in this confusing task. As shown in Figure 2, by using the feature level attention, we also get the feature attention scores of "mother" class and "child" class respectively. The features with high scores have deep color, and the features with low scores have light color. Obviously, different classes may have different feature score vector, in other words, the same feature of different classes have different importance. So our feature level attention can highlight importance of the useful features and weaken the importance of the noise features, then the distance between the prototype vector and the query vector will measure the difference between them more efficiently.
(a) Feature attention scores of "mother" class (b) Feature attention scores of "child" class We treat the final prototype embedding vector as the features of each instance, then we can get the distribution of features by principal pomponent analysis in feature space as shown in Figure  3. As we can see, the instances without hierarchical attention are more distributed and may cross with each other, but the instances with hierarchical attention are more centralized and discriminative, which proves that our model learns a better semantic space, which helps to distinguish confus-ing data.. . The left blue points marked × are instances of "mother" class and the right orange points marked • are instances of "child" class.

Augmentability of support set
More support instances can contribute more useful information to the prototype vector, meanwhile, more noise will be added in.
In this section, we define the support set augmentability (SSA) as the additive value of accuracy when we increase the same number of the support set for different models. So we compare our model's SSA with other models such as Proto and PHATT on the 10 way FewRel task, and the shot number ranges from 5 to 25.
By using the hierarchical attention, our model obtaines a strong robustness and can pay more attention to the important information of support set and reduce those negative effects of noisy data, thus as shown in Figure 4, the support set augmentability of our model is larger than other models. Benefit from the above advantages, we can deploy our model in the cold start stage, and gradually accumulate labeled support data in practical applications, then improve the performance of the model day by day, and thus improve the utilization rate of few data in realistic settings.

Convergence speed comparison
At the training stage, we also compare the convergence speed between Proto, PHATT, and HAPN on the 10 way 5 shot and 10 way 15 shot FewRel task. As shown in Figure 5, our model can be optimized more quickly than the other models. From 10 way 5 shot task to 10 way 15 shot settings, the Proto model takes almost twice time to achieve 70% accuracy on validation set, in other words, the convergence speed will decrease sharply when we increase the number of support instances, but

Conclusion
Previous few-shot learning models for text classification roughly apply text representations or neglect the noisy information. We propose to do hierarchical attention prototypical networks consisting of feature level, word level and instance level multi cross attention, which highlight the important information of few data and learn a more discriminative prototype representation. In the experiments, our model achieves the state-of-theart performance on FewRel and CSID datasets. HAPN not only increases support set augmentability but also accelerates convergence speed in the training stage.
In the future, we will contribute new text dataset to few-shot learning, explore better feature extrac-