Dynamic Memory Induction Networks for Few-Shot Text Classification

This paper proposes Dynamic Memory Induction Networks (DMIN) for few-short text classification. The model develops a dynamic routing mechanism over static memory, enabling it to better adapt to unseen classes, a critical capability for few-short classification. The model also expands the induction process with supervised learning weights and query information to enhance the generalization ability of meta-learning. The proposed model brings forward the state-of-the-art performance significantly by 2~4% improvement on the miniRCV1 and ODIC datasets. Detailed analysis is further performed to show how the proposed network achieves the new performance.


Introduction
Few-shot text classification, which requires models to perform classification with a limited number of training instances, is important for many applications but yet remains to be a challenging task.Early studies on few-shot learning (Salamon and Bello, 2017) employ data augmentation and regularization techniques to alleviate overfitting caused by data sparseness.More recent research leverages meta-learning (Finn et al., 2017;Zhang et al., 2018;Sun et al., 2019) to extract transferable knowledge among meta-tasks in meta episodes.
A key challenge for few-shot text classification is inducing class-level representation from support sets (Gao et al., 2019), in which key information is often lost when switching between meta-tasks.Recent solutions (Gidaris and Komodakis, 2018) leverage a memory component to maintain models' learning experience, e.g., by finding from a supervised stage the content that is similar to the unseen classes, leading to the state-of-the-art performance.However, the memory weights are static during inference and the capability of the model is still limited when adapted to new classes.Another prominent challenge is the instance-level diversity caused by various reasons (Gao et al., 2019;Geng et al., 2019), resulting in the difficulty of finding a fixed prototype for a class (Allen et al., 2019).Recent research has shown that models can benefit from query-aware methods (Gao et al., 2019).
In this paper we propose Dynamic Memory Induction Networks (DMIN) to further tackle the above challenges.DMIN utilizes dynamic routing (Sabour et al., 2017;Geng et al., 2019) to render more flexibility to memory-based few-shot learning (Gidaris and Komodakis, 2018) in order to better adapt the support sets, by leveraging the routing component's capacity in automatically adjusting the coupling coefficients during and after training.Based on that, we further develop induction models with query information to identify, among diverse instances in support sets, the sample vectors that are more relevant to the query.These two modules are jointly learned in DMIN.
The proposed model achieves new state-of-theart results on the miniRCV1 and ODIC datasets, improving the best performance by 2∼4% accuracy.We perform detailed analysis to further show how the proposed network achieves the improvement.
Memory mechanism has shown to be very effective in many NLP tasks (Tang et al., 2016;Das et al., 2017;Madotto et al., 2018).In the fewshot learning scenario, researchers have applied memory networks to store the encoded contextual information in each meta episode (Santoro et al., 2016;Cai et al., 2018;Kaiser et al., 2017).Specifically Qi et al. (2018) and Gidaris and Komodakis (2018) build a two-stage training procedure and regard the supervisely learned class representation as a memory component.

Overall Architecture
An overview of our Dynamic Memory Induction Networks (DMIN) is shown in Figure 1, which is built on the two-stage few-shot framework Gidaris and Komodakis (2018).In the supervised learning stage (upper, green subfigure), a subset of classes in training data are selected as the base sets, consisting of C base number of base classes, which is used to finetune a pretrained sentence encoder and to train a classifier.
In the meta-learning stage (bottom, orange subfigure), we construct an "episode" to compute gradients and update our model in each training iteration.For a C-way K-shot problem, a training episode is formed by randomly selecting C classes from the training set and choosing K examples within each selected class to act as the support set S = ∪ C c=1 {x c,s , y c,s } K s=1 .A subset of the remaining examples serve as the query set Training on such episodes is conducted by feeding the support set S to the model and updating its parameters to minimize the loss in the query set Q.

Pre-trained Encoder
We expect that developing few-shot text classifier should benefit from the recent advance on pretrained models (Peters et al., 2018;Devlin et al., 2019;Radford et al.).Unlike recent work (Geng et al., 2019), we employ BERT-base (Devlin et al., 2019) for sentence encoding , which has been used in recent few-shot learning models (Bao et al., 2019;Soares et al., 2019).The model architecture of BERT (Devlin et al., 2019) is a multi-layer bidi-   rectional Transformer encoder based on the original Transformer model (Vaswani et al., 2017).A special classification embedding ([CLS]) is inserted as the first token and a special token ([SEP]) is added as the final token.We use the d-dimensional hidden vector output from the [CLS] as the representation e of a given text x: e = E(x|θ).The pretrained BERT model provides a powerful contextdependent sentence representation and can be used for various target tasks, and it is suitable for the few-shot text classification task (Bao et al., 2019;Soares et al., 2019).
We finetune the pre-trained BERT encoder in the supervised learning stage.For each input document x, the encoder E(x|θ) (with parameter θ) will output a vector e of d dimension.W base is a matrix that maintains a class-level vector for each base class, serving as a base memory for meta-learning.Both E(x|θ) and W base will be further tuned in the meta training procedure.We will show in our experiments that replacing previous models with pre-trained encoder outperforms the corresponding state-of-the-art models, and the proposed DMIN can further improve over that.

Dynamic Memory Module
At the meta-learning stage, to induce class-level representations from given support sets, we develop a dynamic memory module (DMM) based on knowledge learned from the supervised learning stage through the memory matrix W base .Unlike static memory (Gidaris and Komodakis, 2018), DMM utilizes dynamic routing (Sabour et al., 2017) to render more flexibility to the memory learned from base classes to better adapt support sets.The routing component can automatically adjust the coupling coefficients during and after training, which inherently suits for the need of fewshot learning.
Specifically, the instances in the support sets are first encoded by the BERT into sample vectors {e c,s } K s=1 and then fed to the following dynamic memory routing process.

Dynamic Memory Routing Process
The algorithm of the dynamic memory routing process, denoted as DMR, is presented in Algorighm 1.
Given a memory matrix M (here W base ) and sample vector q ∈ R d , the algorithm aims to adapt the sample vector based on memory M learned in the supervised learning stage.q = DM R(M, q). (1) First, for each entry m i ∈ M , the standard matrix-transformation and squash operations in dynamic routing (Sabour et al., 2017) are applied on the inputs: where the transformation weights W j and bias b j are shared across the inputs to fit the few-shot learning scenario.
We then calculate the Pearson Correlation Coefficients (PCCs) (Hunt, 1986;Yang et al., 2019) between mi and qj .p ij = tanh(P CCs( mij , qj )), (4) where the general formula of PCCs is given above for vectors x 1 and x 2 .Since PCCs values are in the range of [-1, 1], they can be used to encourage or penalize the routing parameters.The routing iteration process can now adjust coupling coefficients, denoted as d i , with regard to the input capsules m i , q and higher level capsules v j .
Since our goal is to develop dynamic routing mechanism over memory for few-shot learning, we add the PCCs with the routing agreements in every routing iteration as shown in Eq. 8.
The Dynamic Memory Module (DMM) aims to use DMR to adapt sample vectors e c,s , guided by the memory W base .That is, the resulting adapted sample vector is computed with e c,s = DM R(W base , e c,s ).

Query-enhanced Induction Module
After the sample vectors e c,s s=1,...,K are adapted and query vectors {e q } L q=1 are encoded by the pretrained encoder, we now incorporate queries to build a Query-guided Induction Module (QIM).The aim is to identify, among (adapted) sample vectors of support sets, the vectors that are more relevant to the query, in order to construct classlevel vectors to better classify the query.Since dynamic routing can automatically adjusts the coupling coefficients to help enhance related (e.g., similar) queries and sample vectors, and penalizes unrelated ones, QIM reuses the DMR process by treating adapted sample vectors as memory of background knowledge about novel classes, and induces class-level representation from the adapted sample vectors that are more relevant/similar to the query under concern.e c = DM R( e c,s s=1,...,K , e q ). (10)

Similarity Classifier
In the final classification stage, we then feed the novel class vector e c and query vector e q to the classifier discussed above in the supervised training stage and get the classification score.The standard setting for neural network classifiers is, after having extracted the feature vector e ∈ R d , to estimate the classification probability vector p by first computing the raw classification score s k of each category k ∈ [1, K * ] using the dot-product operator s k = e T w * k , and then applying softmax operator across all the K * classification scores.However, this type of classifiers do not fit few-shot learning due to completely novel categories.In this work, we compute the raw classification scores using a cosine similarity operator: where e = e e and w * k = w * k w * k are l 2 −normalized vectors, and τ is a learnable scalar value.After the base classifier is trained, all the feature vectors that belong to the same class must be very closely matched with the single classification weight vector of that class.So the base classification weights W base = {w b } C base b=1 trained in the 1st stage can be seen as the base classes' feature vectors.
In the few-shot classification scenario, we feed the query vector e q and novel class vector e c to the classifier and get the classification scores in a unified manner.

Objective Function
In the supervised learning stage, the training objective is to minimize the cross-entropy loss on C base number of base classes given an input text x and its label y: where y is one-hot representation of the ground truth label, and ŷ is the predicted probabilities of base classes with ŷk = sof tmax(s k ).
In the meta-training stage, for each meta episode, given the support set S and query set Q = {x q , y q } L q=1 , the training objective is to minimize the cross-entropy loss on C novel classes.
where ŷq = sof tmax(s q ) is the predicted probabilities of C novel classes in this meta episode, with s q = {s q,c } C c=1 from Equation 12.We feed the support set S to the model and update its parameters to minimize the loss in the query set Q in each meta episode.

Dataset and Evaluation Metrics
We evaluate our model on the miniRCV1 (Jiang et al., 2018) and ODIC dataset (Geng et al., 2019).Following previous work (Snell et al., 2017;Geng et al., 2019), we use few-shot classification accuracy as the evaluation metric.We average over 100 and 300 randomly generated meta-episodes from the testing set in miniRCV1 and ODIC, respectively.We sample 10 test texts per class in each episode for evaluation in both the 1-shot and 5-shot scenarios.

Implementation Details
We use Google pre-trained BERT-Base model as our text encoder, and fine-tune the model in the training procedure.The number of base classes C base on ODIC and miniRCV1 is set to be 100 and 20, respectively.The number of DMR interaction is 3.We build episode-based meta-training models with C = [5, 10] and K = [1, 5] for comparison.In addition to using K sample texts as the support set, the query set has 10 query texts for each of the C sampled classes in every training episode.For example, there are 10 × 5 + 5 × 5 = 75 texts in one training episode for a 5-way 5-shot experiment.
Overall Performance The accuracy and standard deviations of the models are shown in Table 1 and 2. We can see that DMIN consistently outperform all existing models and achieve new state-of-the-art results on both datasets.The differences between DMIN and all the other models are statistically significant under the one-tailed paired t-test at the 95% significance level.Note that LwoF builds a two-stage training procedure with a memory module learnt from the supervised learning and used in the meta-learning stage, but the memory mechanism is static after training, while DMIN uses dynamic memory routing to automatically adjust the coupling coefficients after training to generalize to novel classes, and outperform LwoF significantly.Note also that the performance of some of the baseline models (Rel.Net and Ind. Net) reported in Table 1 and 2 is higher than that in Geng et al. (2019) since we used BERT to replace BiLSTM-based encoders.The BERT encoder improves the baseline models by a powerful context meaning representation ability, and our model can further outperform these models with a dynamic memory routing method.Even with these stronger baselines, the proposed DMIN consistently outperforms them on both dataset.

Ablation Study
We analyze the effect of different components of DMIN on ODIC in Table 3. Specifically, we remove DMM and QIM, and vary the number of DMR iterations.We see that the best performance is achieved with 3 iterations.The results show the effectiveness of both the dynamic memory module and the induction module with query information.and after DMM under a 10-way 5-shot setup on ODIC.We randomly select a support set with 50 texts (10 texts per class) from the ODIC testing set, and obtain the sample vectors before and after DMM, i.e., {e c,s } c=1,...5,s=1...10 and e c,s c=1,...5,s=1...10 .We can see that the support vectors produced by the DMM are better separated, demonstrating the effectiveness of DMM in leveraging the supervised learning experience to encode semantic relationships between lower level instance features and higher level class features for few-shot text classification.

Conclusion
We propose Dynamic Memory Induction Networks (DMIN) for few-shot text classification, which builds on external working memory with dynamic routing, leveraging the latter to track previous learning experience and the former to adapt and generalize better to support sets and hence to unseen classes.The model achieves new state-of-the-art results on the miniRCV1 and ODIC datasets.Since dynamic memory can be a learning mechanism more general than what we have used here for fewshot learning, we will investigate this type of models in other learning problems.

Figure 1 :
Figure 1: An overview of Dynamic Memory Induction Network with a 3-way 2-shot example.

Figure 2
Figure 2 is the t-SNE visualization (Maaten and Hinton, 2008) for support sample vectors before

Figure 2 :
Figure 2: Effect of the Dynamic Memory Module in a 10-way 5-shot setup.

Query Support Set Sample Vector Pretrained Encoder Query Vector
e x i t >

Table 1 :
Comparison of accuracy (%) on miniRCV1 with standard deviations.

Table 2 :
Comparison of accuracy(%) on ODIC with standard deviations.

Table 3 :
Ablation study of accuracy (%) on ODIC in a 5-way setup.