CopyNext: Explicit Span Copying and Alignment in Sequence to Sequence Models

Copy mechanisms are employed in sequence to sequence (seq2seq) models to generate reproductions of words from the input to the output. These frameworks, operating at the lexical type level, fail to provide an explicit alignment that records where each token was copied from. Further, they require contiguous token sequences from the input (spans) to be copied individually. We present a model with an explicit token-level copy operation and extend it to copying entire spans. Our model provides hard alignments between spans in the input and output, allowing for nontraditional applications of seq2seq, like information extraction. We demonstrate the approach on Nested Named Entity Recognition, achieving near state-of-the-art accuracy with an order of magnitude increase in decoding speed.


Introduction
Sequence transduction converts a sequence of input tokens to a sequence of output tokens.It is a dominant framework for generation tasks, such as machine translation, dialogue, and summarization.Seq2seq can also be used for Information Extraction (IE), where the target structure is decoded as a linear output based on an encoded (linear) representation of the input.
As IE is traditionally considered a structured prediction task, it remains today that IE systems are assumed to produce an annotation on the input text.That is, predicting which specific tokens of an input string led to, e.g., the label of PERSON.This is in contrast to text generation which rarely, if ever, needs hard alignments between the input and the desired output.Our work explores a novel extension to seq2seq that provides such alignments.Specifically, we extend pointer (or copy) networks.Unlike the algorithmic tasks originally targeted by Vinyals et al. (2015), tasks in NLP tend to copy spans from the input rather than discontiguous tokens.This is prevalent for copying named entities in dialogue (Gu et al., 2016;Eric and Manning, 2017), entire sentences in summarization (See et al., 2017;Song et al., 2018), or even single words (if subtokenized).The need to efficiently copy spans motivates our introduction of an inductive bias that copies contiguous tokens.Like a pointer network, our model copies the first token of a span.However, for subsequent timesteps, our model generates a "CopyNext" symbol (CN) instead of copying another token from source.CopyNext represents the operation of copying the word following the last predicted word from the input sequence.We apply our model for the Nested Named Entity Recognition (NNER) task (Ringland et al., 2019).Unlike traditional named entity recognition, named entity mentions in NNER may be subsequences of arXiv:2010.15266v1[cs.CL] 28 Oct 2020 other named entity mentions (such as [[last] year] in Figure 1).We find that both explicit copying and CopyNext lead to a system faster than prior work and better than a simple seq2seq baseline.It is, however, outperformed by a much slower model that performs an exhaustive search over the space of potential labels, a solution that does not scale to large complex label sets.

Related Work
Pointer networks (Vinyals et al., 2015;Jia and Liang, 2016;Merity et al., 2016) are seq2seq models that employ a soft attention distribution (Bahdanau et al., 2014) to produce an output sequence consisting of values from the input sequence.Pointer-generator networks (Miao and Blunsom, 2016;Gulcehre et al., 2016, inter alia) extend the range of output types by combining the distribution from the pointer with a vocabulary distribution from a generator.Thus, these models operate on the type level.In contrast, our model operates at the token level.Instead of using soft attention distribution of the encoder states, we use hard attention, resulting in a single encoder state, or a single token, to feed to the decoder.This enables explicit copying of span offsets.
Closest to our work, Zhou et al. (2018) and Panthaplackel et al. ( 2020) have tackled span copying by extending pointer-generator networks and predicting both start and end indices of entire spans that need to be copied.Using those offsets, they perform a forced decoding of the predicted tokens within the span.These works focus on text generation tasks, like sentence summarization, question generation, and editing.In contrast, we are concerned with information extraction tasks as transduction, where hard alignments to the input sentence are crucial and output sequences must represent a valid linearized structure.Specifically, we study nested named entity recognition (NNER).

Model Description
We formulate the task as transforming the input sentence X to a linearized sequence Y which represents the gold structure: labeled spans.Specifically, Y contains input word indices, CopyNext symbols, and labels from a label set L.
As described earlier, the model (Figure 2) is reminiscent of pointer networks.We extend its capabilities by introducing the notion of a "Copy Next" operation where the network predicts to copy the word sequentially after the previous prediction.

Encoder
Embedding Layer This layer embeds a sequence of tokens X = x 1 , x 2 , ..., x N into a sequence of vectors x = x 1 , x 2 , ..., x N by using (possibly contextualized) word embeddings.The gold labels are adjusted to account for tokenization.
Architecture The input embedding is further encoded by a stacked bidirectional LSTM (Hochreiter and Schmidhuber, 1997) into encoder states e = e 1 , e 2 , ..., e N where each state is a concatenation of the forward ( − → f ) and backward ( ← − f ) outputs of the last layer of the LSTM and e i ∈ R D : where e j i is the j-th layer encoder hidden state at timestep i and D is the hidden size of the LSTM.

Decoder
The target for the transducer is the linearized representation of the nested named entity spans and labels.We generate a decision y that either points to Architecture The decoder is a stacked LSTM taking as input either an encoder state e i or a label embedding and produces decoder state Decision Vector We predict scores for making a labeling decision, a CopyNext operation, or pointing to a token in the input.At each decoding step t, for labels, we train a linear layer W L ∈ R D×|L| with input d t and output scores l t .Likewise, we do the same for the CopyNext symbol using a linear layer W C ∈ R D×1 with input d t and output score c t .The score of pointing to an index i in the input sequence is calculated by dot product: s i t = e i • d t .The decision distribution y t is then:

Training and Prediction
Our training objective is the cross-entropy loss: where y is the gold decision, k ∈ [0, N + |L| + 1) (representing all three kinds of possible decisions: 2 We will use ei to refer to e (−1) i .
Figure 3: State machine of a well-formed predicted sequence for masking of the decision vector at inference.At prediction time we find the decision y t with the greatest probability (y t = arg max i (y i t )) at decoder step t. 3 The input to the decoder at t + 1 timestamp can be one of three things: (1) the output e i of the encoder when y t points to the index i of the input sequence, (2) the embedding of the label l predicted at t when y t points to the label l ∈ L, or (3) the output e i+1 of the encoder where i was the input to the decoder at t when y t points to the CopyNext operation.The decoder halts when the EOS label is predicted or the maximum output sequence length is reached.
To ensure well-formed target output sequences, we use a state machine (Figure 3) to mask parts of y t that would lead to an illegal sequence at t + 1.

Experiments and Results
Our experiments analyze the effects of various choices in different components of the system.We use the NNE dataset and splits from Ringland et al. (2019), resulting in 43,457, 1,989, and 3,762   rest of the experiments.The rationale is that the embedding will provide an orthogonal boost in accuracy to the network with respect to the other changes in the network structure.We find in Table 1 that RoBERTa large (Liu et al., 2019) is best.4 Linearization Strategy Previous work (Zhang et al., 2019) has shown that linearization scheme affects model performance.We experiment with several variants of sorting spans in ascending order based on their start index.We also try sorting based on end index and copying the previous token instead.We find sorting based on end index performs poorly, while sorting by start all perform similarly.Our final linearization strategy sorts by start index, then span length (longer spans first).Additional ties (in span label) are broken randomly.
RoBERTa Embedding Layer Recent work suggests that NER information may be stored in the lower layers of an encoder (Hewitt and Manning, 2019;Tenney et al., 2019).We found using the 15th layer of RoBERTa rather than the final one (24th), is slightly helpful (see Appendix A.1).

NNER Results
In Table 2, we evaluate our bestperforming (dev.)model on the test set.We compare our approach against the previous best approaches reported in Ringland et al. (2019): hypergraph-based (Hypergraph, Wang andLu, 2018) and transition-based (Transition, Wang et al., 2018) models proposed to recognize nested mentions.We also contrast the CopyNext model against a baseline seq2seq model and one with only a hard copy operation (see (a) and (c) in Figure 1).Prior work (Wang and Lu, 2018) has given an analysis of the run-time of their approach.Based on their concern about asymptotic speed we also provide the following analysis and practical speed efficiency of the systems and their accuracies. 5e find that Hypergraph outperforms the Copy-Next model by 4.7 F1 with most of the difference in recall.This is likely due to the exhaustive search used by Hypergraph, as our model is 16.7 times faster.An analysis of their code and algorithm reveals that their lower bound time complexity Ω(mn) is higher than ours Ω(n), n is length of input sequence and m is number of mention types.Since the average decoder length is low, the best case scenario often occurs.The Transition system has 6.3 times faster prediction speed compared to Hypergraph, however, it comes with 17.8% absolute drop in F1 accuracy.Our model is substantially faster than both.Furthermore, we show that both an explicit Copy and the CopyNext operation are useful, resulting in gains of 8.1 F1 and 11.3 F1 over a seq2seq baseline.

NNER Error Analysis
The errors made by the model on the development set can be clustered broadly into four main types: (1) correct span detection but mislabeled, (2) correct label but incorrect span detection (either subset or superset of correct span), (3) both span and label were incorrectly predicted, and (4) missing spans entirely.

Conclusion and Future Work
We propose adopting pointer and copy networks with hard attention and extending these models with a CopyNext operation, enabling sequential copying of spans given just the start index of the span.On a traditionally structured prediction task of NNER, we use a sequence transduction model with the CopyNext operation, leading to a competitive model that provides a 16.7x speedup relative to current state of the art (which performs an exhaustive search), at a cost of 4.7% loss in F1, largely due to lower recall.
Our model is a step forward in structured prediction as sequence transduction.We have found in initial experiments on event extraction similar relative improvements to that discussed here: future work will investigate applications to richer transductive semantic parsing models (Zhang et al., 2019;Cai and Lam, 2019).

Figure 1 :
Figure 1: Sequence transduction outputs for nested named entities in an example sentence using: (a) seq2seq, (b) pointer network, (c) Copy-only, and (d) CopyNext model.The numbers are predictions of indices corresponding to the tokens in the input sequence.CN refers to the CopyNext symbol, our proposed method of denoting the operation that copies the next token from the input.In (d), the next token from token 4 would be 5.
Figure  1highlights the difference between output sequences for several transductive models, including our CopyNext model.

Figure 2 :
Figure 2: For each decoder timestep a decision vector chooses between labeling, a CopyNext operation, or pointing to an input token.The decoder input comes from either an encoder state or a label embedding.
(a) a timestep in the encoder sequence, marking the starting index of a span, or (b) the CopyNext symbol, which operates by advancing the right boundary of the span to include the next (sub)word of the input sequence, or (c) a label l ∈ L, signifying both the end of the span and classifying the span.Input Embeddings We learn D-dimensional embeddings for each label l ∈ L. The vectors corresponding to the start index of a span and the Copy-Next operation are the encoder outputs e i where i is equal to the start index or index pointed to by CopyNext and are fed directly to the decoder. 2

Figure 4 :
Figure 4: Performance in terms of Accuracy (%F1) and Speed (relative to Hypergraph).The CopyNext model is nearly as accurate as Hypergraph while over 16 times faster.
sentences in the training, development, and test splits.Experiments for model development and analysis use the development set.

Table 2 :
NNER accuracy and speed on the test set for external baselines and our models.*Seq2seq is based on a reference implementation to ensure correctness, but not efficiency: it has the same asymptotics as the Copy and CopyNext models, and can be considered similar in speed.
Table 7 in Appendix A.1 provides examples.