Instance-Based Learning of Span Representations: A Case Study through Named Entity Recognition

Interpretable rationales for model predictions play a critical role in practical applications. In this study, we develop models possessing interpretable inference process for structured prediction. Specifically, we present a method of instance-based learning that learns similarities between spans. At inference time, each span is assigned a class label based on its similar spans in the training set, where it is easy to understand how much each training instance contributes to the predictions. Through empirical analysis on named entity recognition, we demonstrate that our method enables to build models that have high interpretability without sacrificing performance.


Introduction
Neural networks have contributed to performance improvements in structured prediction.Instead, the rationales underlying the model predictions are difficult for humans to understand (Lei et al., 2016).In practical applications, interpretable rationales play a critical role for driving human's decisions and promoting human-machine cooperation (Ribeiro et al., 2016).With this motivation, we aim to build models that have high interpretability without sacrificing performance.As an approach to this challenge, we focus on instance-based learning.
Instance-based learning (Aha et al., 1991) is a machine learning method that learns similarities between instances.At inference time, the class labels of the most similar training instances are assigned to the new instances.This transparent inference process provides an answer to the following question: Which points in the training set most closely resemble a test point or influenced the prediction?This is categorized into example-based explanations (Plumb et al., 2018;Baehrens et al., 2010).Recently, despite its preferable property, it has received little attention and been underexplored.
This study presents and investigates an instancebased learning method for span representations.A span is a unit that consists of one or more linguistically linked words.Why do we focus on spans instead of tokens?One reason is relevant to performance.Recent neural networks can induce good span feature representations and achieve high performance in structured prediction tasks, such as named entity recognition (NER) (Sohrab and Miwa, 2018;Xia et al., 2019), constituency parsing (Stern et al., 2017;Kitaev et al., 2019), semantic role labeling (SRL) (He et al., 2018;Ouchi et al., 2018) and coreference resolution (Lee et al., 2017).Another reason is relevant to interpretability.The tasks above require recognition of linguistic structure that consists of spans.Thus, directly classifying each span based on its representation is more interpretable than token-wise classification such as BIO tagging, which reconstructs each span label from the predicted token-wise BIO tags.
Our method builds a feature space where spans with the same class label are close to each other.At inference time, each span is assigned a class label based on its neighbor spans in the feature space.We can easily understand why the model assigned the label to the span by looking at its neighbors.Through quantitative and qualitative analysis on NER, we demonstrate that our instancebased method enables to build models that have high interpretability and performance.To sum up, our main contributions are as follows.
• This is the first work to investigate instancebased learning of span representations.(Ribeiro et al., 2016;Lundberg and Lee, 2017;Koh and Liang, 2017).In this paper, instead of looking into the black-box, we build interpretable models based on instance-based learning.
Before the current neural era, instance-based learning, sometimes called memory-based learning (Daelemans and Van den Bosch, 2005), was widely used for various NLP tasks, such as part-of-speech tagging (Daelemans et al., 1996), dependency parsing (Nivre et al., 2004) and machine translation (Nagao, 1984).For NER, some instance-based models have been proposed (Tjong Kim Sang, 2002;De Meulder and Daelemans, 2003;Hendrickx and van den Bosch, 2003).Recently, despite its high interpretability, this direction has not been explored.
One exception is Wiseman and Stratos (2019), which used instance-based learning of token representations.Due to BIO tagging, it faces one technical challenge: inconsistent label prediction.For example, an entity candidate "World Health Organization" can be assigned inconsistent labels such as "B-LOC I-ORG I-ORG," whereas the groundtruth labels are "B-ORG I-ORG I-ORG."To remedy this issue, they presented a heuristic technique for encouraging contiguous token alignment.In contrast to such token-wise prediction, we adopt span-wise prediction, which can naturally avoid this issue because each span is assigned one label.
NER is generally solved as (i) sequence labeling or (ii) span classification. 2In the first approach, token features are induced by using neural networks and fed into a classifier, such as conditional random fields (Lample et al., 2016;Ma and Hovy, 2016;Chiu and Nichols, 2016).One drawback of this approach is the difficulty dealing with nested entities.3By contrast, the span classification approach, adopted in this study, can straightforwardly solve nested NER (Finkel and Manning, 2009; Sohrab  and Miwa, 2018; Xia et al., 2019). 4 3 Instance-Based Span Classification 3.1 NER as span classification NER can be solved as multi-class classification, where each of possible spans in a sentence is assigned a class label.As we mentioned in Section 2, this approach can naturally avoid inconsistent label prediction and straightforwardly deal with nested entities.Because of these advantages over tokenwise classification, span classification has been gaining a considerable attention (Sohrab and Miwa, 2018;Xia et al., 2019).
"Franz Kafka," s = (1, 2), is assigned the person type entity label (y = PER).Note that the other non-entity spans are assigned the null label (y = NULL).For example, "a novelist," s = (4, 5), is assigned NULL.In this way, the NULL label is assigned to non-entity spans, which is the same as the O tag in the BIO tag set.
The probability that each span s is assigned a class label y is modeled by using softmax function: .
Typically, as the scoring function, the inner product between each label weight vector w y and span feature vector h s is used: The score for the NULL label is set to a constant, score(s, y = NULL) = 0, similar to logistic regression (He et al., 2018).For training, the loss function we minimize is the negative log-likelihood: where S(X, Y ) is a set of pairs of a span s and its ground-truth label y.We call this kind of models that use label weight vectors for classification classifier-based span model.

Compute similarity
Figure 1: Illustration of our instance-based span model.An entity candidate "Franz Kafka" is used as a query and vectorized by an encoder.In the vector space, similarities between all pairs of the candidate (s) and the training instances (s 1 , s 2 , . . ., s 9 ) are computed, respectively.Based on the similarities, the label probability (distribution) is computed, and the label with the highest probability PER is assigned to "Franz Kafka."

Instance-based span model
Our instance-based span model classifies each span based on similarities between spans.In Figure 1, an entity candidate "Franz Kafka" and the spans in the training set are mapped onto the feature vector space, and the label distribution is computed from the similarities between them.In this inference process, it is easy to understand how much each training instance contributes to the predictions.This property allows us to explain the predictions by specific training instances, which is categorized into example-based explanations (Plumb et al., 2018).
Formally, within the neighbourhood component analysis framework (Goldberger et al., 2005), we define the neighbor span probability that each span s i ∈ S(X) will select another span s j as its neighbor from candidate spans in the training set: . (1) Here, we exclude the input sentence X and its ground-truth labels Y from the training set D: D = D \ {(X, Y )}, and regard all other spans as candidates: The scoring function returns a similarity between the spans s i and s j .Then we compute the probability that a span s i will be assigned a label y i : Here, S(D , y i ) = {s j ∈ D | y i = y j }, so the equation indicates that we sum up the probabilities of the neighbor spans that have the same label as the span s i .The loss function we minimize is the negative log-likelihood: where S(X, Y ) is a set of pairs of a span s i and its ground-truth label y i .At inference time, we predict ŷi to be the class label with maximal marginal probability: where the probability P(y|s i ) is computed for each of the label set y ∈ Y.

Efficient neighbor probability computation
The neighbor span probability P(s j |s i , D ) in Equation 1 depends on the entire training set D , which leads to heavy computational cost.As a remedy, we use random sampling to retrieve K sentences  (Tjong Kim Sang and De Meulder, 2003) and (ii) nested NER on the GENIA dataset 6 (Kim et al., 2003).We follow the standard trainingdevelopment-test splits.
Baseline We use a classifier-based span model (Section 3.1) as a baseline.Only the difference between the instance-based and classifier-based span models is whether to use softmax classifier or not.
Encoder and span representation We adopt the encoder architecture proposed by Ma and Hovy (2016), which encodes each token of the input sentence w t ∈ X with word embedding and characterlevel CNN.The encoded token representations w 1:T = (w 1 , w 2 , . . ., w T ) are fed to bidirectional LSTM for computing contextual ones − → h 1:T and ← − h 1:T .From them, we create h lstm s for each span s = (a, b) based on LSTM-minus (Wang and Chang, 2016).For flat NER, we use the repre- . 7 We then multiply h lstm s with a weight matrix W and obtain the span representation: h s = W h lstm s .For the scoring function in Equation 1 in the instance-based span model, we use the inner product between a pair of span representations: score(s i , s j ) = h s i • h s j .
Model configuration We train instance-based models by using K = 50 training sentences randomly retrieved for each mini-batch.At test time, we use K = 50 nearest training sentences for each sentence based on the cosine similarities between their sentence vectors 8 .For the word embeddings, we use the GloVe 100-dimensional embeddings (Pennington et al., 2014) and the BERT embeddings (Devlin et al., 2019). 96 We use the same one pre-processed by Zheng et al. (2019) at https://github.com/thecharm/boundary-aware-nested-ner 7 We use the different span representation from the one used for flat NER because concatenating the addition features, − → h a + − → h b and ← − h a + ← − h b , to the subtraction features improves performance in our preliminary experiments. 8For each sentence X = (w1, w2, . . ., w T ), its sentence vector is defined as the vector averaged over the word embeddings (GloVe) within the sentence:

Quantitative analysis
We report averaged F 1 scores across five different runs of the model training with random seeds.
Overall F 1 scores We investigate whether or not our instance-based span model can achieve competitive performance with the classifier-based span model.

Qualitative analysis
To better understand model behavior, we analyze the instance-based model using GloVe in detail.

Examples of retrieved spans
The span feature space learned by our method can be applied to various downstream tasks.In particular, it can be used as a span retrieval system.Table 2 shows five nearest neighbor spans of an entity candidate "Tom Moody."In the classifier-based span model, personrelated but non-entity spans were retrieved.By contrast, in the instance-based span model, person (PER) entities were consistently retrieved.11This tendency was observed in many other cases, and we confirmed that our method can build preferable feature spaces for applications.

Errors analysis
The instance-based span model tends to wrongly label spans that includes location or organization names.For example, in Table 3, the wrong label LOC (Location) is assigned to "Air France" whose gold label is ORG (Organization).Note that by looking at the neighbors, we can understand that country or district entities confused the model.This implies that prediction errors are easier to analyze because the neighbors are the rationales of the predictions.

Discussion
Generalizability Are our findings in NER generalizable to other tasks?To investigate it, we perform an additional experiment on the CoNLL-2000 dataset (Tjong Kim Sang and Buchholz, 2000) for syntactic chunking. 12While this task is similar to NER in terms of short-span classification, the class labels are based on syntax, not (entity) semantics.
In Table 4, the instance-based span model achieved competitive F 1 scores with the classifier-based one, which is consistent with the NER results.This suggests that our findings in NER are likely to generalizable to other short-span classification tasks.
Future work One interesting line of future work is an extension of our method to span-to-span relation classification, such as SRL and coreference resolution.Another potential direction is to apply and evaluate learned span features to downstream tasks requiring entity knowledge, such as entity linking and question answering.

Conclusion
We presented and investigated an instance-based learning method that learns similarity between spans.Through NER experiments, we demonstrated that the models build by our method have (i) competitive performance with a classifier-based span model and (ii) interpretable inference process where it is easy to understand how much each training instance contributes to the predictions.Network setup Basically, we follow the encoder architecture proposed by Ma and Hovy (2016).First, the token-encoding layer encodes each token of the input sentence w t ∈ (w 1 , w 2 , . . ., w T ) to a sequence of the vector representations w 1:T = (w 1 , w 2 , . . ., w T ).For the models using GloVe, we use the GloVe 100-dimensional embeddings 13 (Pennington et al., 2014) and character-level CNN.

A Appendices
For the models using BERT, we use the BERT-Base, Cased 14 (Devlin et al., 2019), where we use the first subword embeddings within each token in the last layer of BERT.During training, we fix the word embeddings (except the CNN).Then, the encoded token representations w 1:T = (w 1 , w 2 , . . ., w T ) are fed to bidirectional LSTM (BiLSTM) (Graves et al., 2013) for computing contextual ones − → h 1:T and ← − h 1:T .We use 2 layers of the stacked BiL-STMs (2 forward and 2 backward LSTMs) with 100-dimensional hidden units.From − → h 1:T and ← − h 1:T , we create h lstm s for each span s = (a, b) based on LSTM-minus (Wang and Chang, 2016).For flat NER, we use the representation h We then multiply h lstm s with a weight matrix W and obtain the span representation: h s = W h lstm s .Finally, we use the span representation h s for computing the label distribution in each model.For efficient computation, following Sohrab and Miwa (2018), we enumerate all possible spans in a sentence with the sizes less than or equal to the maximum span size L, i.e., each span s = (a, b) is satisfied with the condition b − a < L. We set L as 6.
Optimization To optimize the parameters, we use Adam (Kingma and Ba, 2014) with β 1 = 0.9 and β 2 = 0.999.The initial learning rate is set to η 0 = 0.001.The learning rate is updated on each epoch as η t = η 0 /(1 + ρt), where the decay rate is ρ = 0.05 and t is the number of epoch completed.
A gradient clipping value is set to 5.0 (Pascanu et al., 2013).Parameter updates are performed in mini-batches of 8.The number of training epochs is set to 100.We save the parameters that achieve the best F1 score on each development set and evaluated them on each test set.Training the models takes less than one day on a single GPU, NVIDIA DGX-1 with Tesla V100.To better understand span representations learned by our method, we observe the feature space.Specifically, we visualize the span representations h s on the CoNLL-2003 development set. Figure 3 visualizes two-dimensional entity span representations by t-distributed Stochastic Neighbor Embedding (t-SNE) (Maaten and Hinton, 2008).Both models successfully learned feature spaces where the instances with the same label come close each other.

Data
the training set D .At training time, we randomly sample K sentences for each mini-batch at each epoch.This simple technique realizes time and memory efficient training.In our experiments, it takes less than one day to train a model on a single GPU 5 .We evaluate the span models through two types of NER: (i) flat NER on the CoNLL-2003 dataset

Figure 2 :
Figure 2: Performance on the CoNLL-2003 development set for different amounts of the training set.

Figure 3 :
Figure 3: Visualization of entity span features computed by classifier-based and instance-based models. 1

Table 1 :
Comparison between classifier-based and instance-based span models.Cells show the F 1 scores and standard deviations on each test set.

Table 2 :
Example of span retrieval.An entity candidate "Tom Moody" in the CoNLL-2003 development set used as a query for retrieving five nearest neighbors from the training set.

Table 3 :
Example of an error by the instance-based span model.Although the gold label is ORG (Organization), the wrong label LOC (Location) is assigned.

Table 4 :
Comparison in syntactic chunking.Cells show F 1 and standard deviations on the CoNLL-2000 test set.

Table 5 :
Hyperparameters used in the experiments.