Label-Agnostic Sequence Labeling by Copying Nearest Neighbors

Retrieve-and-edit based approaches to structured prediction, where structures associated with retrieved neighbors are edited to form new structures, have recently attracted increased interest. However, much recent work merely conditions on retrieved structures (e.g., in a sequence-to-sequence framework), rather than explicitly manipulating them. We show we can perform accurate sequence labeling by explicitly (and only) copying labels from retrieved neighbors. Moreover, because this copying is label-agnostic, we can achieve impressive performance in zero-shot sequence-labeling tasks. We additionally consider a dynamic programming approach to sequence labeling in the presence of retrieved neighbors, which allows for controlling the number of distinct (copied) segments used to form a prediction, and leads to both more interpretable and accurate predictions.


Introduction
Retrieve-and-edit style structured prediction, where a model retrieves a set of labeled nearest neighbors from the training data and conditions on them to generate the target structure, is a promising approach that has recently received renewed interest Gu et al., 2018;Weston et al., 2018). This approach captures the intuition that while generating a highly complex structure from scratch may be difficult, editing a sufficiently similar structure or set of structures may be easier.
Recent work in this area primarily uses the nearest neighbors and their labels simply as an additional context for a sequence-to-sequence style model to condition on. While effective, these models may not explicitly capture the discrete operations (like copying) that allow for the neighbors to be edited into the target structure, making inter-preting the behavior of the model difficult. Moreover, since many retrieve-and-edit style models condition on dataset-specific labels directly, they may not easily allow for transfer learning and in particular to porting a trained model to a new task with different labels.
We address these limitations in the context of sequence labeling by developing a simple labelagnostic model that explicitly models copying token-level labels from retrieved neighbors. Since the model is not a function of the labels themselves but only of a learned notion of similarity between an input and retrieved neighbor inputs, it can be effortlessly ported (zero shot) to a task with different labels, without any retraining. Such a model can also take advantage of recent advances in representation learning, such as BERT (Devlin et al., 2018), in defining this similarity.
We evaluate the proposed approach on standard sequence labeling tasks, and show it is competitive with label-dependent approaches when trained on the same data, but substantially outperforms strong baselines when it comes to zero-shot transfer applications, such as when training with coarse labels and testing with fine-grained labels.
Finally, we propose a dynamic programming based approach to sequence labeling in the presence of retrieved neighbors, which allows for trading off token-level prediction confidence with trying to minimize the number of distinct segments in the overall prediction that are taken from neighbors. We find that such an approach allows us to both increase the interpretability of our predictions as well as their accuracy.

Related Work
Nearest neighbor based structured prediction (also referred to as instance-or memory-based learning) has a long history in machine learning and NLP,  with early successes dating back at least to the taggers of Daelemans (Daelemans, 1993;Daelemans et al., 1996) and the syntactic disambiguation system of Cardie (1994). Similarly motivated approaches remain popular for computer vision tasks, especially when it is impractical to learn a parametric labeling function (Shakhnarovich et al., 2006;Schroff et al., 2015). More recently, there has been renewed interest in explicitly conditioning structured predictions on retrieved neighbors, especially in the context of language generation Gu et al., 2018;Weston et al., 2018), although much of this work uses neighbors as extra conditioning information within a sequenceto-sequence framework (Sutskever et al., 2014), rather than making discrete edits to neighbors in forming new predictions.
Retrieval-based approaches to structured prediction appear particularly compelling now with the recent successes in contextualized word embedding (McCann et al., 2017;Peters et al., 2018;Radford et al.;Devlin et al., 2018), which should allow for expressive representations of sentences and phrases, which in turn allow for better retrieval of neighbors for structured prediction.
Finally, we note that there is a long history of transfer-learning based approaches to sequence labeling (Ando and Zhang, 2005;Daume III, 2007;Schnabel and Schütze, 2014;Zirikly and Hagiwara, 2015;Peng and Dredze, 2016;Yang et al., 2017;Rodriguez et al., 2018, inter alia), though it is generally not zero-shot. There has, however, been recent work in zero-shot transfer for sequence labeling problems with binary tokenlabels (Rei and Søgaard, 2018).

Nearest Neighbor Based Labeling
While nearest-neighbor style approaches are compelling for many structured prediction problems, we will limit ourselves here to sequence-labeling problems, such as part-of-speech (POS) tagging or named-entity recognition (NER), where we are given a T -length sequence x = x 1:T (which we will assume to be a sentence), and we must predict a corresponding T -length sequence of labelŝ y =ŷ 1:T for x. We will assume that for any given task there are Z distinct labels, and denote x's true but unknown labeling as y = y 1:T ∈ {1, . . . , Z} T .
Sequence-labeling is particularly convenient for nearest-neighbor based approaches, since a predictionŷ can be formed by simply concatenating labels extracted from the label-sequences associated with neighbors. In particular, we will assume we have access to a database and their corresponding true label-sequences y (m) . We will predict a labelingŷ for x by considering each token x t , selecting a labeled token x (m) k from D, and then settingŷ t = y (m) k . 1

A Token-Level Model
We consider a very simple token-level model for this label-agnostic copying, where the probability that x's t'th label y t is equal to y (m) k -the k'th label token of sequence x (m) -simply depends on the similarity between x t and x (m) k , and is independent of the surrounding labels, conditioned on x and D. 2 In particular, we define where the above probability is normalized over all label tokens of all label-sequences in D. Above, represent the contextual word embeddings of the t'th token in x and the k'th token in x (m) , respectively, as obtained by running a deep sequence-model over x and over x (m) . In all experiments we use BERT (Devlin et al., 2018), a model based on the Transformer 1 More precisely, we will setŷt to be an instance of the label type of which y (m) k is a label token; this distinction between label types and tokens can make the exposition unnecessarily obscure, and so we avoid it when possible.
2 While recent sequence labeling models (Ma and Hovy, 2016;Lample et al., 2016), often model inter-label dependence with a first-order CRF (Lafferty et al., 2001), Devlin et al. (2018) have recently shown that excellent performance can be obtained by modeling labels as conditionally independent given a sufficiently expressive representation of x.
We fine-tune these contextual word embeddings by maximizing a latent-variable style probabilistic objective where we sum over all individual label tokens in D that match y t .
At test time, we predictŷ t to be the label type with maximal marginal probability.
That is, we setŷ t to be arg max z M m=1 k: y where z ranges over the label types (e.g., POS or named entity tags) present in D. As noted in the introduction, predicting labels in this way allows for the prediction of any label type present in the database D used at test time, and so we can easily predict label types unseen at training time without any additional retraining.

Data and Methods
Our main experiments seek to determine both whether the label-agnostic copy-based approach introduced above results in competitive sequencelabeling performance on standard metrics, as well as whether this approach gives rise to better zeroshot transfer. Accordingly, our first set of experiments consider several standard sequence-labeling tasks and datasets, namely, POS tagging the Penn Treebank (Marcus et al., 1993) with both the standard Penn Treebank POS tags and Universal POS tags (Petrov et al., 2012;Nivre et al., 2016), and the CoNLL 2003 NER task (Sang and Buchholz, 2000;Sang and De Meulder, 2003). We compare with the sequence-labeling performance of BERT (Devlin et al., 2018), which we take to be near state of the art. We use the standard datasetsplits and evaluations for all tasks, and BIO encoding for all segment-level tagging tasks. We evaluate zero-shot transfer performance by training on one dataset and evaluating on another, without any retraining. In particular, we consider three zero-shot transfer scenarios: training with Universal POS Tags on the Penn Treebank and then predicting the standard, fine-grained POS tags, training on the CoNLL 2003 NER data and predicting on the fine-grained OntoNotes NER data (Hovy et al., 2006) using the setup of Strubell et al. (2017), and finally training on the CoNLL 2003 chunking data and predicting on the CoNLL 2003 NER data. We again compare with a BERT baseline, where labels from the original task are deterministically mapped to the most frequent label on the new task with which they coincide. 3 Our nearest-neighbor based models were finetuned by retrieving the 50 nearest neighbors of each sentence in a mini-batch of either size 16 or 20, and maximizing the objective (2) above. For training, nearest neighbors were determined based on cosine-similarity between the averaged toplevel (non-fine-tuned) BERT token embeddings of each sentence. In order to make training more efficient, gradients were calculated only with respect to the input sentence embeddings (i.e., the x t in (1)) and not the embeddings x (m) k of the tokens in D. At test time, 100 nearest neighbors were retrieved for each sentence to be labeled using the fine-tuned embeddings.
The baseline BERT models were fine-tuned using the publicly available huggingface BERT implementation, 4 and the "base" weights made available by the BERT authors (Devlin et al., 2018). We made word-level predictions based on the embedding of the first tokenized word-piece associated with a word (as Devlin et al. (2018) do), and ADAM (Kingma and Ba, 2014) was used to fine-tune all models. Hyperparameters were chosen using a random search over learning rate, batch size, and number of epochs. Code for duplicating all models and experiments is available at https://github.com/swiseman/ neighbor-tagging.

Main Results
The results of our experiments on standard sequence labeling tasks are in Table 1. We first note that all results are quite good, and are competitive with the state of the art. The label-agnostic model tends to underperform the standard finetuned BERT model only very slightly, though consistently, and is typically within several tenths of a point in performance.
The results of our zero-shot transfer experiments are in Table 2. We see that in all cases the label-agnostic model outperforms standard fine-NER tuned BERT, often significantly. In particular, we note that when going from universal POS tags to standard POS tags, the fine-tuned label-agnostic model manages to outperform the standard mostfrequent-tag-per-word baseline, which itself obtains slightly less than 92% accuracy. The most dramatic increase in performance, of course, occurs on the Chunking to NER task, where the label-agnostic model is successfully able to use chunking-based training information in copying labels, whereas the parametric fine-tuned BERT model can at best attempt to map NP-chunks to PERSON labels (the most frequent named entity in the dataset).
In order to check that the increase in performance is not due only to the BERT representations themselves, Table 2 also shows the results of nearest neighbor based prediction without fine-tuning ("NN (no FT)" in the table) on any task. In all cases, this leads to a decrease in performance.

Encouraging Contiguous Copies
Although we model token-level label copying, at test time eachŷ t is predicted by selecting the label type with highest marginal probability, without any attempt to ensure that the resulting sequenceŷ resembles one or a few of the labeled neighbors y (m) . In this section we therefore consider a decoding approach that allows for controlling the trade-off between prediction confidence and minimizing the number of distinct segments inŷ that represent direct (segment-level) copies from some neighbor, in the hope that having fewer  distinct copied segments in our predictions might make them more interpretable or accurate. We emphasize that the following decoding approach is in fact applicable even to standard sequence labeling models (i.e., non-nearest-neighbor based models), as long as neighbors can be retrieved at test time.
To begin with a simple case, suppose we already know the true labels y for a sequence x, and are simply interested in being able to reconstruct y by concatenating as few segments y i:j that appear in some y (m) ∈ D as possible. More precisely, define the set Z D to contain all the unique label type sequences appearing as a subsequence of some sequence y (m) ∈ D. Then, if we're willing to tolerate some errors in reconstructing y, we can use a dynamic program to minimize the number of mislabelings in our now "prediction"ŷ, plus the number of distinct segments used in formingŷ multiplied by a constant c, as follows: where J(0) = 0 is the base case and |z| is the length of sequence z. Note that greedily selecting sequences that minimize mislabelings may result in using more segments, and thus a higher J.
In the case where we do not already know y, but wish to predict it, we might consider a modification of the above, which tries to minimize c times the number of distinct segments used in formingŷ plus the expected number of mislabelings: where we have used the linearity of expectation. Note that to use such a dynamic program to predictŷ we only need an estimate of p(y t−k+j = z j | x, D), which we can obtain as in Section 3 (or from a more conventional model).
In Figure 3 we plot both the F 1 score and the average number of distinct segments used in predicting eachŷ against the c parameter from the dynamic program above, for the CoNLL 2003 NER development data in both the standard and zeroshot settings. First we note that we are able to obtain excellent performance with only about 1.5 distinct segments per prediction, on average; see Figure 2 for examples. Interestingly, we also find that using a higher c (leading to fewer distinct segments) can in fact improve performance. Indeed, taking the best values of c from Figure 3 (0.4 in the standard setting and 0.5 in the zero-shot setting), we are able to improve our performance on the test set from 89.94 to 90.20 in the standard setting and from 71.74 to 73.61 in the zero shot setting, respectively; see Tables 1 and 2.

Conclusion
We have proposed a simple label-agnostic sequence-labeling model, which performs nearly as well as a standard sequence labeler, but improves on zero-shot transfer tasks. We have also  proposed an approach to sequence label prediction in the presence of retrieved neighbors, which allows for discouraging the use of many distinct segments in a labeling. Future work will consider problems where more challenging forms of neighbor manipulation are necessary for prediction.