Multi-Level Matching and Aggregation Network for Few-Shot Relation Classification

This paper presents a multi-level matching and aggregation network (MLMAN) for few-shot relation classification. Previous studies on this topic adopt prototypical networks, which calculate the embedding vector of a query instance and the prototype vector of the support set for each relation candidate independently. On the contrary, our proposed MLMAN model encodes the query instance and each support set in an interactive way by considering their matching information at both local and instance levels. The final class prototype for each support set is obtained by attentive aggregation over the representations of support instances, where the weights are calculated using the query instance. Experimental results demonstrate the effectiveness of our proposed methods, which achieve a new state-of-the-art performance on the FewRel dataset.


Introduction
Relation classification (RC) is a fundamental task in natural language processing (NLP), which aims to identify the semantic relation between two entities in text. For example, the instance "[London] e1 is the capital of [the UK] e2 " expresses the relation capital of between the two entities London and the UK.
Some conventional relation classification methods (Bethard and Martin, 2007;Zelenko et al., 2002) adopted supervised training and suffered from the lack of large-scale manually labeled data. To address this issue, the distant supervision method (Mintz et al., 2009) was proposed which annotated training data by heuristically aligning knowledge bases (KBs) and texts. However, the long-tail problem in KBs (Xiong et al., 2018;  1: A data example of 5-way-5-shot relation classification in FewRel development set. The correct relation class for the query instance is class A: mother. The instances for other relation classes are omitted for saving space. et al., 2018) still exists and makes it hard to classify the relations with very few training samples. This paper focuses on the few-shot relation classification task, which was designed to address the long-tail problem. In this task, only few (e.g., 1 or 5) support instances are given for each relation, as shown by an example in Table 1.
The few-shot learning problem has been studied extensively in computer vision (CV) field. Some methods adopt meta-learning architectures (Santoro et al., 2016;Ravi and Larochelle, 2016;Finn et al., 2017;Munkhdalai and Yu, 2017), which learn fast-learning abilities from previous experiences (e.g., training set) and then rapidly gen-eralize to new concepts (e.g., test set). Some other methods use metric learning based networks (Koch et al., 2015;Vinyals et al., 2016;Snell et al., 2017), which learn the distance distributions among classes. A simple and effective metricbased few-shot learning method is prototypical network (Snell et al., 2017). In a prototype network, query and support instances are encoded into an embedding space independently. Then, a prototype vector for each class candidate is derived as the mean of its support instances in the embedding space. Finally, classification is performed by calculating the distances between the embedding vector of the query and all class prototypes. This prototype network method has also been applied to few-shot relation classification recently (Han et al., 2018). This paper proposes a multi-level matching and aggregation network (MLMAN) for few-shot relation classification. Different from prototypical networks, which represent support sets without dependency on query instances, our proposed MLMAN model encodes each query instance and each support set in an interactive way by considering their matching information at both local and instance levels. At local level, the local context representations of a query instance and a support set are softly matched toward each other following the sentence matching framework (Chen et al., 2017). Then, the matched local representations are aggregated into an embedding vector for each query and each support instance using max and average pooling. At instance level, the matching degree between the query instance and each of the support instances is calculated via a multi-layer perceptron (MLP). Taking the matching degrees as weights, the instances in a support set are aggregated to form the class prototype for final classification. All these matching and aggregation layers in the MLMAN model are estimated jointly using training data. Since the representations of the support instances in each class are expected to be close with each other, an auxiliary loss function is further designed to measure the inconsistency among all support representations in each class.
In summary, our contributions in this paper are three-fold. First, a multi-level matching and aggregation network is proposed to encode query instances and class prototypes in an interactive fashion. Second, an auxiliary loss function measuring the consistency among support instances is de-signed. Third, our method achieves a new state-ofthe-art performance on FewRel, a public few-shot relation classification dataset.

Relation Classification
Relation classification is to identify the semantic relation between two entities in one sentence. In recently years, neural networks have been widely applied to deal with this task. Zeng et al. (2014) employed position features and convolutional neural networks (CNNs) to capture the structure and contextual information respectively. Then, a max pooling operation was adopted to determine the most useful features. Wang et al. (2016) proposed multi-level attention CNNs, which captured both entity-specific attention and relation-specific pooling attention in order to better discern patterns in heterogeneous contexts. Zhou et al. (2016) proposed attention-based bidirectional long shortterm memory networks (AttBLSTMs) to capture the most important semantic information in a sentence. All of these methods require a large amount of training data and can't quickly adapt to a new class that has never been seen.

Metric Based Few-Shot Learning
In few-shot learning paradigm, a classifier is required to generalize to new classes with only a small number of training samples. The metric based approach aims to learn a set of projection functions that take support and query samples from the target problem and classify them in a feed forward manner. This approach has lower complexity and is easier for implementation than meta-learner based approach (Ravi and Larochelle, 2016;Finn et al., 2017;Santoro et al., 2016;Munkhdalai and Yu, 2017).
Some metric based few-shot learning methods have been developed for computer vision (CV) tasks, and all these methods encoded each support or query image to a vector independently for classification. Koch et al. (2015) proposed a method for learning siamese neural networks, which employed an unique structure to encode both support and query samples respectively and one more layer computing the induced distance metric between the pair. Vinyals et al. (2016) proposed to learn a matching network augmented with attention and external memories. And also, an episodebased training procedure was proposed, which was based on a principle that test and training conditions must match and has been adopted by many following studies. Snell et al. (2017) proposed prototypical networks that learn a metric space in which classification can be performed by computing distances to prototype representations of all classes, and the prototype representation of each class was the mean of all its support samples. Garcia and Bruna (2017) defined a graph neural network architecture to assimilate generic messagepassing inference algorithms, which generalized above three models.
Regarding with few-shot relation classification, Han et al. (2018) adopted prototypical networks to build baseline models on the FewRel dataset. Gao et al. (2019) proposed hybrid attention-based prototypical networks to handle noisy training samples in few-shot learning. In this paper, we improve the conventional prototypical networks for few-shot relation classification by encoding the query instance and class prototype interactively through multi-level matching and aggregation.

Sentence Matching
Sentence matching is essential for many NLP tasks, such as natural language inference (NLI) (Bowman et al., 2015) and response selection (Lowe et al., 2015). Some sentence matching methods mainly rely on sentence encoding (Mueller and Thyagarajan, 2016;Conneau et al., 2017;, which encode a pair sentences independently and then transmit their embeddings into a classifier, such as a neural network, to decide the relationship between them. Some other methods are based on joint models (Chen et al., 2017;Gong et al., 2017;Kim et al., 2018), which use cross-features to represent the local (i.e., word-level and phrase-level) alignments for better performance. In this paper, we follow the joint models to achieve the local matching between a query instance and the support set for a class. The difference between our task and the other sentence matching tasks mentioned above is that, our goal is to match a sentence to a set of sentences, instead of to another sentence (Bowman et al., 2015) or to a sequence of sentences (Lowe et al., 2015).

Task Definition
In few-shot relation classification, we are given two datasets, D meta−train and D meta−test . Each dataset consists of a set of samples (x, p, r), where x is a sentence composed of T words and the tth word is w t , p = (p 1 , p 2 ) indicate the positions of two entities, and r is the relation label of the instance (x, p). These two datasets have their own relation label spaces that are disjoint with each other. Under few-shot configuration, D meta−test is splited into two parts, D test−support and D test−query . If D test−support contains K labeled samples for each of N relation classes, this target few-shot problem is named N -way-K-shot. D test−query contains test samples, each labeled with one of the N classes. Assuming that we only have D test−support and D test−query , we can train a model using D test−support and evaluate its performance on D test−query . But limited by the number of support samples (i.e,., N ×K), it is hard to train a good model from scratch.
Although D meta−train and D meta−test have disjoint relation label spaces, D meta−train can also been utilized to help the few-shot relation classification on D meta−test . One approach is the paradigm proposed by Vinyals et al. (2016), which obey an important machine learning principle that test and train conditions must match. That's to say, we also split D meta−train into two parts, D train−support and D train−query , and mimic the few-shot learning settings at training stage. In each training iteration, N classes are randomly selected from D train−support , and K support instances are randomly selected from each class. In this way, we construct the train-support set S = {s i k ; i = 1, ..., N, k = 1, ..., K}, where s i k is the k-th instance in class i. And also, we randomly select R samples from the remaining samples of those N classes and construct the train- Just like conventional prototypical networks, we expect to minimize the following objective function at training time and P(l|S, q) is defined as .
( 2) The function f ({s i k } K k=1 , q) is to calculate the matching degree between the query instance q and  the set of support instances {s i k } K k=1 . How to design this function is the focus of this paper.

Methodology
In this section, we will introduce our proposed multi-level matching and aggregation network (MLMAN) for modeling f ({s i k } K k=1 , q). For simplicity, we will discard the superscript i of s i k from Section 4.1 to Section 4.4. The framework of our proposed MLMAN model is shown in Fig. 1, which has four main modules.
• Context Encoder. Given a sentence and the positions of two entities within this sentence, CNNs (Zeng et al., 2014) are adopted to derive the local context representations of each word in the sentence.
• Local Matching and Aggregation. Similar to (Chen et al., 2017), given the local representation of a query instance and the local representations of K support instances, the attention method is employed to collect local matching information between them. Then, the matched local representations are aggregated to represent each instance as an embedding vector.
• Instance Matching and Aggregation. The matching information between a query instance and each of the K support instances are calculated using an MLP. Then, we take the matching degrees as weights to sum the representations of support instances in order to get the class prototype.
• Class Matching. An MLP is built to calculate the matching score between the representations of the query instance and the class prototype.
More details of these four modules will be introduced in the following subsections.

Context Encoder
For a query or support instance, each word w t in the sentence x is first mapped into a d w -dimensional word embedding e t (Pennington et al., 2014). In order to describe the position information of the two entities in this instance, the position features (PFs) proposed by Zeng et al. (2014) are also adopted in our work. Here, PFs describe the relative distances between current word and the two entities, and are further mapped into two vectors p 1t and p 2t of d p dimensions. Finally, these three vectors are concatenated to get the word representation w t = [e t ; p 1t ; p 2t ] of d w +2d p dimensions, and the instance can be written as W ∈ R T ×(dw+2dp) .
The most popular models for local context encoding are recurrent neural networks (RNNs) with long short-term memories (LSTMs) (Hochreiter and Schmidhuber, 1997) and convolutional neural networks (CNNs) (Kim, 2014). In this paper, we employ CNNs to build the context encoder. For an input instance W ∈ R T ×(dw+2dp) , we input it into a CNN with d c filters. The output from the CNN is a matrix with T × d c dimensions. In this way, the context representations of the query instance Q ∈ R Tq×dc and the context representations of support instances {S k ∈ R T k ×dc ; k = 1, ..., K} are obtained, where T q and T k are the sentence lengths of the query sentence and the k-th support sentence respectively.

Local Matching and Aggregation
In order to get the matching information between Q and {S k ; k = 1, ..., K}, we first concatenate the K support instance representations into one matrix as follow where C ∈ R Ts×dc with T s = K k=1 T k . Then, we collect the matching information between Q and C and calculate their matched representations Q and S as follows where m ∈ {1, ..., T q } in Eq. (5), n ∈ {1, ..., T s } in Eq. (6), q m and q m are the m-th rows of Q and Q respectively, and c n and c n are the n-th rows of C and C respectively. Next, the original representations and the matched representations are fused utilizing a ReLU layer as follows, where is the element-wise product and W 1 ∈ R 4dc×d h is the weight matrix at this layer for reducing dimensionality.C is further split into K representations {S k } K k=1 corresponding to the K support instances whereS k ∈ R T k ×d h . AllS k andQ are fed into a single-layer Bi-directional LSTM (BLSTM) with d h hidden units along each direction to obtain the final local matching results S k ∈ R T k ×2d h and Q ∈ R Tq×2d h .
Local aggregation aims to convert the results of local matching into a single vector for each query and each support instance. In this paper, we employ a max pooling together with an average pooling, and concatenate their results into one vector s k or q. The calculations are as follows, where { s k , q} ∈ R 4d h .

Instance Matching and Aggregation
Similar to conventional prototypical networks (Snell et al., 2017), our proposed method calculates class prototype s via the representations of all support instances in this class, i.e., { s k } K k=1 . However, instead of using a naive mean operation, we aggregate instance-level representations via attention over { s k } K k=1 , where each weight is derived from the instance matching score between s k and q. The matching function is as follow, where W 2 ∈ R d h ×8d h and v ∈ R d h . β k describes the instance-level matching degree between the query instance q and the support instance s k . Then, all { s k } K k=1 are aggregated into one vector s as and s is the class prototype.

Class Matching
After the class prototype s and the embedding vector of the query instance q have been determined, the class-level matching function f ({s k } K k=1 , q) in Eq. (2) is defined as ). (13) Eq. (11) and (13) have the same form. In our experiments, sharing the weights W 2 and v in these two equations, i.e., employing the exactly same function for both instance-level and classlevel matching in each training iteration, lead to better performance.

Joint Training with Inconsistency Measurement
If the representations of all support instances in a class are far away from each other, it could become difficult for the derived class prototype to capture the common characteristics of all support instances. Therefore, a function which measures the inconsistency among the set of support instances is designed. In order to avoid the high complexity of directly comparing every two support instances in a class, we calculate the inconsistency measurement as the average Euclidean distance between the support instances and the class prototype as where i is the class index and || · || 2 calculates the 2-norm of a vector. By combining Eqs.
(1) and (14), the final objective function for training the whole model is defined as where λ is a hyper-parameter and was set as 1 in our experiments without any tuning.

Dataset and Evaluation Metrics
The few-shot relation classification dataset FewRel 2 was adopted in our experiments. This dataset was first generated by distant supervision and then filtered by crowdsourcing to remove noisy annotations. The final FewRel dataset consists of 100 relations, each has 700 instances. The average number of tokens in each sentence is 24.99, and there are 124,577 unique tokens in total. The 100 relations are split into 64, 16 and 20 for training, validation and test respectively. Our experiments investigated four few-shot learning configurations, 5 way 1 shot, 5 way 5 shot, 10 way 1 shot, and 10 way 5 shot, which were the same as Han et al. (2018). According to the official evaluation scripts 3 , all results given by our experiments were the mean and standard deviation values of 10 training repetitions, and were tested using 20,000 independent samples.

Training Details and Hyperparameters
All of the hyperparameters used in our experiments are listed in Table 3. The 50-dimensional Glove word embeddings released by Pennington et al. (2014) 4 were adopted in the context encoder and were fixed during training. For the unknown words, we just replaced them with an unique special token <UNK> and fixed its embedding as a zero vector. Previous study (Munkhdalai and Yu, 2017) found that the models trained on harder tasks may achieve better performances than using the same configurations at both training and test stages. Therefore, we set N = 20 to construct the train-support sets for 5-way and 10-way tasks.  and R ∈ {5, 10, 15} were conducted to determine their optimal values. For optimization, we employed mini-batch stochastic gradient descent (SGD) with the initial learning rate of 0.1. The learning rate was decayed to one tenth every 20,000 steps. And also, dropout layers (Hinton et al., 2012) were inserted before CNN and LSTM layers and the drop rate was set as 0.2. Table 2 shows the results of different models tested on FewRel test set. The results of the first four models, Meta Network (Munkhdalai and Yu, 2017), GNN (Garcia and Bruna, 2017), SNAIL (Mishra et al., 2018), Prorotypical Network (Snell et al., 2017), were reported by Han et al. (2018). These models were initially proposed for image classification. Han et al. (2018) just replaced their image encoding module with an instance encoding module and kept other modules unchanged. Proto-HATT (Gao et al., 2019) added hybrid attention mechanism to prototypical networks, mainly focusing on improving the performance on few-shot relation classification with N > 1. From Table  2, we can see that our proposed MLMAN model outperforms all other models by a large margin, which shows the effectiveness of considering the  interactions between query instance and support set at multiple levels.

Ablation Study
In order to evaluate the contributions of individual model components, ablation studies were conducted. Table 4 shows the performance of our model and its ablations on the development set of FewRel. Considering that the first 6 ablations only affected the few-shot learning tasks with N > 1, model 2 to model 7 achieved exactly the same performance as the complete model (i.e., model 1) under 5 way 1 shot and 10 way 1 shot configurations.

Instance Matching and Aggregation
First, the attention-based instance aggregation introduced in Section 4.3 was replaced with a max pooling (model 4) or an average pooling (model 5). We can see that the model with instance-level attentive aggregation (model 1) outperformed the ones using a max pooling (model 4) or an average pooling (model 5) on 5-shot tasks. Their difference were significantly at 1% significance level in t-test. The advantage of attentive pooling is that the weights of integrating all support instances can be determined dynamically according to the query. For example, when conducting instance matching and aggregation between the query instance and the support set in Table 1, the weights of the 5 instances in class A were 0.03, 0.46, 0.25, 0.08 and 0.18 respectively. Instance #2 achieved the highest weight because it had the best similarity with the query instance and was considered as the most helpful one when matching the query instance with class A. Then, the effectiveness of sharing the weight parameters in Eqs. (11) and (13) was evaluated by untying them (model 3). The performance of model 3 was much worse than the complete model (model 1) as shown in Table 4, which demonstrates the need of sharing the weights for calculating matching scores at both instance and class levels.

Inconsistency Measurement
As introduced in Section 4.5, J incon is designed to measure the inconsistency among the representations of all support instances in a class. After removing J incon , model 2 was optimized only using the objective function J match . We can see that it performed much worse than the complete model. Furthermore, we calculated the mean of the Euclidean distances between every support instance pair ( s i k , s i k ) in the same class using model 1 and model 2 respectively. For each support set, the calculation can be written as (16) We sampled 20,000 support sets under the 5-way 5-shot configuration and calculated the mean of them. The results were 0.0199 and 0.0346 for model 1 and model 2 respectively, which means that J incon was effective at forcing the representations of the support instances in the same class to be close with each other.
J incon was further removed from model 5 and model 6 was obtained. It can be found that the accuracy degradation from model 5 to model 6 was larger than the one from model 1 to model 2. This implies that the J incon objective function also benefited from the attentive aggregation over support instances.

Local Matching
First, the concatenation operation in local matching was removed from model 6 in this ablation study. That's to say, instead of concatenating the representations of all support instances {S k } K k=1 into one single matrix as Eq. (3), local matching was conducted between the query instance and each support instance separately to get their vector representations {( s k , q k ); k = 1, ..., K} (model 7). It should be noticed that this led to K different representations of a query instance according to each support class. Then, the mean over k for s k and q k were calculated to get the representations of the support set s and the query instance q. Comparing model 6 and model 7, we can see that the concatenation operation plays an important role in our model. One possible reason is that the concatenation operation can help local matching to restrain the support instances with low similarity to the query.
Second, the whole local matching module together with the concatenation and attentive aggregation operation were removed from model 6, which led to model 9. Model 9 is similar to the one proposed by Snell et al. (2017) that encoded the support and query instances independently. The difference was that model 9 was equipped with more components, including an LSTM layer, two pooling operations, and a learnable class matching function. Comparing the performance of model 6 and model 9 in Table 4, we can see that the local matching operation significantly improves the performance in few-shot relation classification. Fig.  2 shows the attention weight matrix calculated between the query instance and the support instance #2 of class A in Table 1. From this figure, we can see that the attention-based local matching is able to capture some matching relations of local contexts, such as the head entities Eva Funck and Cindy Robbins, the tail entities Gustav and Kimberly Beck, the key phrases son and daughter, the same keyword "married", and so on.

Class Matching
In this experiment, we compared two class matching functions, (1) Euclidean distance (ED) (Snell et al., 2017) and (2) a learnable MLP function as shown by Eq. (13). In order to ignore the influence of the instance-level attentive aggregation, these two matching functions were compared based on model 6 and model 9. After converting the MLP function in model 6 and model 9 to Euclidean dis- query instance Figure 2: The attention weight matrix calculated between the query instance and the support instance #2 of class A in Table 1. The darker units have larger value. The summation of one column in the matrix is one.
tance, model 8 and model 10 were obtained. Comparing the performance of these models in Table  4, we have two findings.
(1) When local matching was adopted, the learnable MLP for class matching (model 6) outperformed the ED metric (model 8) by a large margin.
(2) After removing local matching, the learnable MLP for class matching (model 9) performed not as good as the ED metric (model 10). One possible reason is that the local matching process enhances the interaction between a query instance and a support set when calculating s and q. Thus, simple Euclidean distance between them may not be able to describe the complex correlation and dependency between them. On the other hand, MLP mapping is more powerful than calculating Euclidean distance, and can be more appropriate for class matching when local matching is also adopted.

Conclusions
In this paper, a neural network with multi-level matching and aggregation has been proposed for few-shot relation classification. First, the query and support instances are encoded interactively via local matching and aggregation. Then, the support instances in a class are further aggregated to form the class prototype and the weights are calculated by attention-based instance matching. Finally, a learnable MLP matching function is employed to calculate the class matching score between the query instance and each candidate class. Furthermore, an additional objective function is designed to improve the consistency among the vector rep-resentations of all support instances in a class. Experiments have demonstrated the effectiveness of our proposed model, which achieves state-of-theart performance on the FewRel dataset. Studying few-shot relation classification with data generated by distant supervision and extending our ML-MAN model to zero-shot learning will be the tasks of our future work.