Incremental Processing in the Age of Non-Incremental Encoders: An Empirical Assessment of Bidirectional Models for Incremental NLU

While humans process language incrementally, the best language encoders currently used in NLP do not. Both bidirectional LSTMs and Transformers assume that the sequence that is to be encoded is available in full, to be processed either forwards and backwards (BiLSTMs) or as a whole (Transformers). We investigate how they behave under incremental interfaces, when partial output must be provided based on partial input seen up to a certain time step, which may happen in interactive systems. We test five models on various NLU datasets and compare their performance using three incremental evaluation metrics. The results support the possibility of using bidirectional encoders in incremental mode while retaining most of their non-incremental quality. The"omni-directional"BERT model, which achieves better non-incremental performance, is impacted more by the incremental access. This can be alleviated by adapting the training regime (truncated training), or the testing procedure, by delaying the output until some right context is available or by incorporating hypothetical right contexts generated by a language model like GPT-2.


Introduction
In "The Story of Your Life", a science fiction short story by Ted Chiang (2002), Earth is visited by alien creatures whose writing system does not unfold in time but rather presents full thoughts instantaneously.In our world, however, language does unfold over time, both in speaking and in writing.There is ample evidence (Marslen-Wilson, 1975;Tanenhaus and Brown-Schmidt, 2008, inter alia) that it is also processed over time by humans, in an incremental fashion where the interpretation of a full utterance is continuously built up while the utterance is being perceived.
In Computational Linguistics and Natural Language Processing, this property is typically abstracted away by assuming that the unit to be processed (e.g., a sentence) is available as a whole. 1  The return and subsequent mainstreaming of Recurrent Neural Networks (RNNs), originally introduced by Elman (1990) and repopularized i.a. by Mikolov et al. (2010), may have made it seem that time had found a place as a first-class citizen in NLP.However, it was quickly discovered that certain technical issues of this type of model could be overcome, for example in the application of machine translation, by encoding input sequences in reverse temporal order (Sutskever et al., 2014).
This turns out to be a special case of the more general strategy of bidirectional processing, proposed earlier in the form of BiRNNs (Schuster and Paliwal, 1997;Baldi et al., 1999) and BiL-STMs (Hochreiter and Schmidhuber, 1997), which combine a forward and a backward pass over a sequence.More recently, Transformers (Vaswani et al., 2017) also function with representations that inherently have no notion of linear order.Atemporal processing has thus become the standard again.
In this paper, we explore whether we can adapt such bidirectional models to work in incremental processing mode and what the performance cost is of doing so.We first go back and reproduce the work of Huang et al. (2015), who compare the performance of LSTMs and BiLSTMs in sequence tagging, extending it with a BERT-based encoder and with a collection of different datasets for tagging and classification tasks.Then we address the following questions: 1 An exception is the field of research on interactive systems, where it has been shown that incremental processing can lead to preferable timing behavior (Aist et al., 2007;Skantze and Schlangen, 2009) and work on incremental processing is ongoing ( Žilka and Jurčíček, 2015;Trinh et al., 2018;Coman et al., 2019, inter alia).Q1.If we employ inherently non-incremental models in an incremental system, do we get functional representations that are adequate to build correct and stable output along the way?We examine how bidirectional encoders behave under an incremental interface, revisiting the approach proposed by Beuck et al. (2011) for POS taggers.After standard training, we modify the testing procedure by allowing the system to see only successively extended prefixes of the input available so far with which they must produce successively extended prefixes of the output, as shown in Figure 1.The evaluation metrics are described in Section 3, and the discussion is anchored on the concepts of timeliness, monotonicity, and decisiveness and their trade-off with respect to the non-incremental quality (Beuck et al., 2011;Köhn, 2018).We show that it is possible to use them as components of an incremental system (e.g. for NLU) with some trade-offs.
Q2. How can we adapt the training regime or the real-time procedure to mitigate the negative effect that the non-availability of right context (i.e., future parts of the signal) has on nonincremental models?To tackle this question, we implement three strategies that help improve the models' incremental quality: truncated training, delayed output and prophecies (see Section 4).
Our results are relevant for incremental Natural Language Understanding, needed for the design of dialogue systems and more generally interactive systems, e.g.those following the incremental processing model proposed by Schlangen and Skantze (2011).These systems rely on the availability of partial results, on which fast decisions can be based.Similarly, simultaneous translation is an area where decisions need to be based on partial input with incomplete syntactical and semantic information.

Bidirectionality
Language is one of the cognitive abilities that have a temporal nature.The inaugural adoption of RNNs (Elman, 1990) in NLP showed a pursuit to provide connectionist models with a dynamic memory in order to incorporate time implicitly, not as a dimension but through its effects on processing.Since then, the field has witnessed the emergence of a miscellany of neural architectures that take the temporal structure of language into account.In particular, LSTMs (Hochreiter and Schmidhuber, 1997) have been vastly used for sequence-tosequence or sequence classification tasks, which are ubiquitous in NLP.
More recently, Vaswani et al. (2017) has consolidated the application of attention mechanisms on NLP tasks with Transformers, which are not constrained by only two directions, as BiLSTMs.Instead, complete sentences are accessed at once.The need for NLP neural networks to be grounded on robust language models and reliable word representations has become clear.The full right and left context of words started to play a major role as in Peters et al. (2018), which resorts to bidirectionality to train a language model.In a combination of bidirectional word representations with the Transformer architecture, we observe the establishment of BERT (Devlin et al., 2019) as a current state-ofthe-art model, on top of which an output layer can be added to solve classification and tagging tasks.

Incremental processing
The motivation to build incremental processors, as defined by Kempen and Hoenkamp (1982) and Levelt (1989), is twofold: they are more cognitively plausible and, from the viewpoint of engineering, real-time applications such as parsing (Nivre, 2004), SRL (Konstas et al., 2014), NLU (Peldszus et al., 2012), dialog state tracking (Trinh et al., 2018), NLG and speech synthesis (Buschmeier et al., 2012) and ASR (Selfridge et al., 2011) require that the input be continually evaluated based on incoming prefixes while the output is being produced and updated.
Another advantage is a better use of computational resources, as a module does not have to wait for the completion of another one to start processing (Skantze and Schlangen, 2009).In robots, linguistic processing must also be intertwined with its perceptions and actions, happening simultaneously (Brick and Scheutz, 2007).
Research on processing and generating language incrementally has been done long before the current wave of neural network models, using several different methods.For example, in ASR, a common strategy has been to process the input incrementally to produce some initial output, which was then re-scored or re-processed with a more complex model (Vergyri et al., 2003;Hwang et al., 2009).While the recent accomplishments of neural encoders are cherished, bidirectional encoders drift apart from a desirable temporal incremental approach because they are trained to learn from complete sequences.
There is some cognitive resemblance underlying RNNs in the sense that they can process sequences word-by-word and build intermediary representations at every time step.This feature provides a legitimate way to employ them in incremental systems.Trinh et al. (2018) and Žilka and Jurčíček (2015) explore this, for instance, using the LSTM's representations to predict dialogue states after each word.Recent works on simul-taneous translation also use RNNs as incremental decoders (Dalvi et al., 2018).
Some works arouse interest in the incremental abilities of RNNs.Hupkes et al. (2018) use a diagnostic classifier to analyze the representations that are incrementally built by sequence-to-sequence models in disfluency detection and conclude that the semantic information is only kept encoded for a few steps after it appears in the dialogue, being soon forgotten afterwards.Ulmer et al. (2019) propose three metrics to assess the incremental encoding abilities of LSTMs and compare it with the addition of attention mechanisms.
According to Beuck et al. (2011) and Schlangen and Skantze (2011), incrementality is not a binary feature.Besides using inherently incremental algorithms, it is also possible to provide incremental interfaces to non-incremental algorithms.Such interfaces simply feed ever-increasing prefixes to what remains a non-incremental algorithm, providing some "housekeeping" to manage the potentially non-monotonic results.
To alleviate the effect of the partiality of the input, we test the use of anticipated continuations, inspired by the mechanism of predictive processing discussed in cognitive science (Christiansen and Chater, 2016) and the idea of interactive utterance completion introduced by DeVault et al. (2011).Related strategies to predict upcoming content and to wait for more right context are also applied in recent work on simultaneous translation (Grissom II et al., 2014;Oda et al., 2015;Ma et al., 2019).The use of truncated inputs during training, discussed below, aims at making intermediate structures available during learning, an issue discussed in Köhn (2018).This is a variation of chunked training used in Dalvi et al. (2018).

Evaluation of incremental processors
The hierarchical nature of language makes it likely that incremental processing leads to non-monotonic output due to re-analysis, as in the well-known "garden path" sentences.Incremental systems may edit the output by adding, revoking, and substituting its parts (Baumann et al., 2011).We expect an incremental system to produce accurate output as soon as possible (Trinh et al., 2018), with a minimum amount of revocations and substitutions, ideally only having correct additions, to avoid jittering that may be detrimental to subsequent processors working on partial outputs.
To assess the incremental behavior of sequence tagging and classification models, we use the evaluation metrics for incremental processors established by Schlangen et al. (2009) and Baumann et al. (2011).The latter defines three diachronic metrics: edit overhead (EO ∈ [0, 1]), the proportion of unnecessary edits (the closer to 0, the fewer edits were made); correction time (CT ∈ [0, 1]), the fraction of the utterance seen before the system commits on a final decision for a piece of the output (the closer to 0, the sooner final decisions were made); and relative correctness (RC ∈ [0, 1]), the proportion of outputs that are correct with respect to the non-incremental output (being close to 1 means the system outputs were most of the time correct prefixes of the non-incremental output).
The sequence tagging tasks we evaluate are massively incremental (Hildebrandt et al., 1999), meaning that a new label is always added to the output after a new word is processed.The models can also substitute any previous labels in the output sequence in the light of new input.Sequence classifiers must add one label (the sequence's class) after seeing the first word and can only substitute that single label after each new word.In both cases, additions are obligatory and substitutions should ideally be kept as low as possible, but there can be no revocations.Moreover, our data is sequential, discrete, and order-preserving (Köhn, 2018).
Given a sequence of length n, the number of necessary edits is always the number of tokens in the sequence (all additions) for sequence taggers and we set it to 1 for sequence classifiers.All other edits (substitutions) count as unnecessary and their number is bounded by n−1 i=1 i for tagging, and by (n − 1), for classification.
We need to slightly adapt the CT measure for sequences.It is originally defined as FD-F0, the time step of a final decision minus the time step when the output first appeared.F0 is fixed for every word in a sequence (the systems always output a new label corresponding to each new word it sees), but each label will have a different FD.In order not to penalize initial labels, which have more opportunities of being substituted than final ones, we instead sum the FD of each token and divide by the sum of the number of times each one could be modified, to get a score for the sequence as a whole.Let the sequence length be n, then here CTscore = ( n i=1 FD i )/( n i=1 n − i).We define it to be 0 for sequences of one token.Again, 0 means every label is immediately committed, 1 means all final decisions are delayed until the last time step.
Figure 2 presents a concrete example of how to estimate the metrics.
Based on the trade-off between responsiveness and output quality (Skantze and Schlangen, 2009), we also estimate whether there is any improvement in the quality of the outputs if the encoder waits for some right context to appear before committing on output previously generated.For that, we use delayed EO and delayed RC (also named discounted in Baumann et al., 2011), which allows one or two words of the right context to be observed before outputting previous labels, named EO/RC∆1 and EO/RC∆2, respectively.
In order to concentrate on the incremental quality despite the eventual non-incremental deficiencies, we follow the approach by Baumann et al. (2011) and evaluate intermediate outputs in comparison to the processor's final output, which may differ from the gold output but is the same as the nonincremental output.The general non-incremental correctness should be guaranteed by having high accuracy or F1 score in the non-incremental performance.

Models
We test the behavior of five neural networks, illustrated in Figure 3, under an incremental processing interface operating on word level and having full sentences as processing units: a) a vanilla LSTM; b) a vanilla BiLSTM; c) an LSTM with a CRF (Conditional Random Field) layer; d) a BiLSTM with a CRF layer; and e) BERT.The vanilla LSTM is the only model that works solely in temporal direction.
We choose to use the basic forms of each model to isolate the effect of bidirectionality.They per- form well enough on the tasks to enable a realistic evaluation (see Table 1).Note that state-of-the-art results are typically achieved by combining them with more sophisticated mechanisms.We use the models for both sequence tagging and classification.They use the representation at each time step to predict a corresponding label for sentence tagging, whereas for sequence classification they use the representation of the last time step (LSTM) or a combination of the last forward and backward representations (BiLSTM) or, in case of BERT, the representation at the CLS (initial) token, as suggested in Devlin et al. (2019).The two models with CRF cannot be used for classification, as there are no transition probabilities to estimate.
Sequence tagging implies a one-to-one mapping from words to labels, so that for every new word the system receives, it outputs a sequence with one extra label.In sequence classification, we map every input to a single label.In that case, the LSTM can also edit the output since it can change the chosen label as it processes more information.Because the datasets we use are tokenized and each token has a corresponding label, we follow the instructions given by Devlin et al. (2019) for dealing with BERT's subtokenization: the scores of the first subtoken are used to predict its label, and further subtoken scores are ignored.
Except for the LSTM on sequence tagging, all models' outputs are non-monotonic, i.e., they may reassign labels from previous words.The concept of timeliness is trivial here because we know exactly that the label for the t-th word will appear for the first time at the t-th version of the output, for all t.Even so, we can delay the output to allow  some lookahead.In terms of decisiveness, all models commit to a single output at every time step.Figure 4 shows an example of the computation graph.BiLSTMs can recompute only the backward pass, while BERT needs a complete recomputation.

Strategies
We check the effect of three strategies: truncated training, delayed output and prophecies.In the first case, we modify the training regime by stripping off the endings of each sentence in the training set.We randomly sample a maximum length l ≤ n, where n is the original sentence length, and cut the subsequent words and labels.We expect this to encourage the model to know how to deal with truncated sequences that it will have to process during testing.
The second strategy involves allowing some upcoming words to be observed before outputting a label corresponding to previous words.This is a case of lookahead described in Baumann et al. (2011), where the processor is allowed to wait for some right context before making a first decision with respect to previous time steps.We experiment with right contexts of one or two words, ∆1 and ∆2, respectively.∆1 means the model outputs the first label for word t once it consumes word t + 1. Analogously, ∆2 means the model can observe words t + 1 and t + 2 before outputting the first label for word t. Figure 5 illustrates how to calculate EO with ∆1 delay for the same example as in Figure 2.
In the third strategy, we first feed each prefix as left context in the GPT-2 language model and let it generate a continuation up to the end of a sentence to create a hypothetical full context that meets the needs of the non-incremental nature of the models (see Figure 6 for an example).Not surprisingly, the mean BLEU scores of the prophecies with respect to the real continuation of the sentences are less than 0.004 for all datasets.2out of the ground also enhances your overall body strength and stamina.
Chunking, NER, SRL, and slot filling use the BIO labeling scheme and are evaluated using the F1 score adapted for sequence evaluation, whereas the performance on POS tagging and classification tasks is measured by accuracy.
The models map from raw words to labels without using any intermediate annotated layer, even though they are available in some datasets.The only exception is the SRL task, for which we concatenate predicate embeddings to word embeddings following the procedure described in He et al. (2017), because a sequence can have as many label sequences as its number of predicates.

Implementation
During training, we minimize cross entropy using the Adam method for optimization (Kingma and Ba, 2014).We perform hyperparameter search for the LSTM model using Comet's Bayes search algorithm,3 to maximize the task's performance measure on the validation set and use its best hyperparameters for all other models, except BERT, for which we use HuggingFace's pre-trained bertbase-cased model.
We use GloVe embeddings (Pennington et al., 2014) to initialize word embeddings for all models except BERT, which uses its own embedding mechanism.Random embeddings are used for outof-GloVe words.We randomly replace tokens by a general <unk> token with probability 0.02 and use this token for all unknown words in the validation and test sets ( Žilka and Jurčíček, 2015).
No parameters are kept frozen during training.Overfitting is avoided with early stopping and dropout.Our implementation uses PyTorch v.1.3.1, and prophecies are generated with HuggingFace's port of the GPT-2 language model.The evaluation of incrementality metrics is done on the test sets. 4

Results
The results in Table 1 (above) support the observation that, in general, bidirectional models do have a better non-incremental performance than LSTMs (except for IntentATIS and ProsCons) and that there is an overall considerable improvement in the use of BERT model for all tasks.Truncated training reduces overall performance but even so BERT with truncated training outperforms all models, even with usual training, in most tasks (except for slot filling and IntentATIS).
Figure 7 presents an overview of the incremental evaluation metrics for all models and tasks.Sequence tagging has, in general, low EO and low CT score; i.e., labels are not edited much and a final decision is reached early.That does not hold for BERT, whose CT score and EO is, in general, higher.CT score and EO in sequence classification are also higher because the label in this case should capture a more global representation, which cannot reasonably be expected to be very good when only a small part of the sequence has yet been seen.
When it comes to RC (correctness relative to the final output), again BERT has worse results than other models, especially for tagging.For sequence classification, BERT's performance is more in line with the other models.Achieving high RC is desirable because it means that, most of the time, the partial outputs are correct prefixes of the nonincremental output and can be trusted, at least to the same degree that the final result can be trusted.
This overview shows that although BERT's nonincremental performance is normally the highest, the quality of its incremental outputs is more unstable.The next step is examining the effect of the three strategies that seek to improve the quality and stability of incremental outputs.Figure 8 shows that truncated training is always beneficial, as is delayed evaluation, with both strategies reducing EO and increasing RC.The fact that delay helps in all cases indicates that most substitutions happen in the last or last but one label (the right frontier, given the current prefix), or, in other words, that even having a limited right context improves quality substantially. 5rophecies are detrimental in classification tasks, but they help in some tagging tasks, especially for BERT.Most importantly, any of the strategies cause a great improvement to BERT's incremental performance in sequence tagging, making its metrics be on the same level as other models while retaining its superior non-incremental quality.
Note that while CT and RC can only be measured once the final output is available, an estimate of EO may be evaluated on the fly if we consider the edits and additions up to the last output.Figure 9 shows how the mean EO evolves, breaking out the results for cases where the non-incremental final output will be correct and those where it will not with respect to the gold labels.We can observe an intriguing pattern: the mean EO grows faster for cases where the final response will be wrong; this is most pronounced for the sequence classification task.It might be possible to use this observation as an indication of how much to trust the final result: If the incremental computation was more unstable than the average, we should not expect the final result to be good.However, initial experiments on building a classifier based on the instability of partial outputs have so far not been successful in cashing in on that observation.Figure 7: Comparison of evaluation metrics for all models and tasks.The incremental behavior is more stable for sequence tagging than for sequence classification.BERT takes longer to reach final decisions, and its outputs are edited more often than other models, especially in sequence tagging tasks.

Discussion and conclusion
We show that bidirectional encoders can be adapted to work under an incremental interface without a too drastic impact on their performance.Even though the training (being done on complete sequences) differs from the testing situation (which exposes the model to partial input), the incremental metrics of most models are, in general, good: in sequence tagging, edit overhead is low, final decisions are taken early, and often partial outputs are a correct prefix of the complete non-incremental output.Sequence classification is more unstable because, at initial steps, there is a higher level of uncertainty on what is coming next.Our experiments show that the deficiencies of BERT in the incremental metrics can be mitigated with some adaptations (truncated training or prophecies together with delay), which make its incremental quality become as good as those of other models.
Since the semantic information is only kept encoded for a few steps in RNNs (Hupkes et al., 2018), this may be a reason why delay causes incremental metrics to be much better.If long-range dependencies are not captured, only neighboring words exert more influence in the choice of a label, so after seeing two words in the right context, the system rarely revises labels further back.BERT, having access to the whole sentence at any time step, is less stable because new input can cause it to reassess past labels more easily.
Besides, we also found evidence of different behavior of the instability of partial outputs between correct and incorrect output sequences, which could potentially be a signal of later lower quality.This could be used, for example, in dialog systems: if edit overhead gets too high, a clarification request should be made.A follow-up idea is training a classifier that predicts more precisely how likely it is that the final labels will be accurate based on the development of EO.However, our initial experiments on building such classifier were not successful.We suppose this is due to the fact that, in our datasets, incorrect final output sequences still usually have more than 90% correct labels, so the learnable signal may be too weak.
The use of GPT-2 prophecies led to promising improvements for BERT in sequence tagging.We see room for improvement, e.g.resorting to domain adaptation to make prophecies be more related to each genre.A natural extension is training a language model that generates the prophecies together with the encoder.
Finally, we believe that using attention mechanisms to study the grounding of the edits, similarly to the ideas in Köhn (2018), can be an important step towards understanding how the preliminary representations are built and decoded; we want to test this as well in future work.correct means that all final output labels of a sentence are right and incorrect means that at least one label of the final output sequence is wrong.All models are more unstable when their non-incremental final output is incorrect with respect to the gold output.
• Dropout is implemented after the embedding layer and after the encoder layer with the same value.
• PyTorch's and Numpy's manual seeds are set to 2204 for all experiments.

Figure 1 :
Figure1: Incremental interface on a bidirectional tagging model (here for chunking).Each line represents the input and output at a time step.Necessary additions are green/bold, substitutions are yellow/underlined, and the dashed frame shows the output of the final time step, which is the same as the non-incremental model's.

Figure 2 :
Figure 2: How we estimate the evaluation metrics for the complete sequence of outputs from Figure 1.

Figure 3 :
Figure 3: Models for sequence tagging, w=word and l=label.(a) is the only inherently incremental.(a), (b) and (e) can also be used for sequence classification if we consider only their final representation.

Figure 4 :
Figure 4: Incremental interface of a non-incremental bidirectional model, showing the input and output at time step 3.The context vector fed into the backward LSTM can be zero or initialized with a hypothetical right context generated by a language model.

Figure 5 :
Figure 5: Example of the calculation of Edit Overhead with ∆1 delay for the example in Figure 2. The first choice for each label happens once the subsequent word has been observed, except for the last token in the sentence.
was required before heating up, this can save thousands during rehydration.aswell as international crude can drive the increase, pushing up oil prices while leaving oil production… in April and further accelerated today, the New York Standard said, and in…

Figure 6 :
Figure 6: Input throughout time steps using hypothetical right contexts generated by GPT-2, providing a full sequence for the backward direction.

Figure 8 :Figure 9 :
Figure 8: Comparison of mean Edit Overhead and mean Relative-Correctness on the baseline incremental interface and the three strategies using observations from all tasks except SRL.

Table 1 :
Non-incremental performance of all models on test sets (truncated training in parentheses).The results are not necessarily state-of-the-art because we use basic forms of each model in order to isolate the effect of bidirectionality and have comparable results among different tasks.

Table 3 :
Hyperparameter search for LSTM model.The best configuration was also used for LSTM+CRF, BiLSTM and BiLSTM+CRF.Runtime in minutes.

Table 4 :
Hyperparameter search for BERT model.Runtime in minutes.

Table 5 :
Number of parameters in each model.

Table 7 :
Non-incremental performance all models on validation sets for the purpose of reproducibility.Values in parentheses refer to using truncated samples during training.

Table 8 :
Sentence-level non-incremental performance of all models on test sets (the same as accuracy in sequence classification).Values in parentheses refer to using truncated samples during training.

Table 9 :
Mean values of Edit Overhead, Correction Time Score and Relative-Correctness. Values in parentheses refer to using truncated samples during training.