Location Attention for Extrapolation to Longer Sequences

Neural networks are surprisingly good at interpolating and perform remarkably well when the training set examples resemble those in the test set. However, they are often unable to extrapolate patterns beyond the seen data, even when the abstractions required for such patterns are simple. In this paper, we first review the notion of extrapolation, why it is important and how one could hope to tackle it. We then focus on a specific type of extrapolation which is especially useful for natural language processing: generalization to sequences that are longer than the training ones. We hypothesize that models with a separate content- and location-based attention are more likely to extrapolate than those with common attention mechanisms. We empirically support our claim for recurrent seq2seq models with our proposed attention on variants of the Lookup Table task. This sheds light on some striking failures of neural models for sequences and on possible methods to approaching such issues.


Introduction
It is indisputable that, in recent years, neural network research has made stunning progress on a wide variety of tasks that require to process sequential inputs, such as machine translation  and speech recognition (Graves et al., 2013). However, many researchers have questioned the forms of generalization that neural networks exhibit, which significantly diverge from human-like generalization (Lake and Baroni, 2017b;Geirhos et al., 2018). This discrepancy with human-like generalization is particularly true when it comes to extrapolating "outside" the training space (DeLosh et al., 1997;Marcus, 1998). As a result, the majority of successes of neural networks only require interpolating "between" the training data points, even though extrapolation is arguably more important in human-like cognition (Marcus, 2018).
As neural networks are powerful memorizers (Zhang et al., 2016) and easily learn superficial statistical cues (Jo and Bengio, 2017), testing extrapolation and generalization to samples from the long tails of a distribution might be the only way of quantifying their capacity of abstract reasoning (Barrett et al., 2018). Methods that are capable of abstract reasoning could be key to improve sample efficiency and interpretability as this is akin to how humans learn.
Given those benefits, one might wonder why so little work has been done in extrapolation. One possible explanation is that the probability of encountering a test example in the extrapolation setting seems low when the training set D is large. 1 However, such an argument fails to consider the possibly high cost of error in an extrapolation setting and how such a cost can be a barrier for real-world scenarios (e.g. self-driving cars).
In this paper, we focus on extrapolation in sequences. More precisely, how to generalize sequence-to-sequence predictors to inputs of length n * > n D , where n D denotes the length of the longest sequence in the training set. Such extrapolation is crucial for language acquisition, where humans have limited learning resources to account for the unbounded nature of language. To successfully generalize, a language learner needs to process new and potentially longer sentences than previously encountered ones (Chomsky, 1956).
Accounting for this unbounded nature of language is challenging for neural networks. This issue has recently been uncovered for seq2seq models by looking at simple artificial tasks (Lake and Baroni, 2017a;Liska et al., 2018;Weber et al., 2018). Liska et al. (2018) find that seq2seq archi-1 Extrapolation is still common in practical scenarios as high dimensional problems would typically require an exponentially large D to be representative and the underlying distribution may vary over time (Hooker, 2004). tectures have a small probability of generalizing, suggesting that neural networks could generalize but lack inductive biases that favors extrapolatable behavior.
In the following sections, we review the concepts of extrapolation and attention mechanisms. We then argue that current attention mechanisms, which are largely responsible for recent successes in natural language processing (NLP), are unlikely to extrapolate because they are based on the content of trained embeddings. This leads us to introduce a novel location-based attention that is loosely inspired by human visual attention. To avoid gaining extrapolation capabilities at the cost of expressivity, we introduce an attention mixer that can combine content-and position-based attentions. Finally, we show that recurrent models equipped with this new attention mechanism can extrapolate to longer sequences.

Extrapolation
The concept of extrapolation is often used but rarely formally defined despite the various definitions proposed by Ebert et al. (2014). In this work, we use the most restrictive one, such that models we deem extrapolatable are extrapolatable regardless of the definition used.
Given any finite training dataset D := {x (n) } N n=1 ⊂ R d , we define the interpolation domain to be the d-dimensional interval i ] and the extrapolation domain its complement I extra := R d \ I inter . A schematic representation is depicted in Figure 1. Throughout this paper, we assume that neural networks with inputs or temporary representations in I extra will break. Indeed, for a given target function t : R d → R to approximate, there is an infinite amount of predictors that satisfy f (x) = t(x), ∀x ∈ I inter ⊂ R d . Without any additional constraints, it is thus extremely unlikely that f (x) = t(x), ∀x ∈ R d . This explains why neural networks have empirically been found to break in extrapolation settings (Lohninger, 1999;Hettiarachchi et al., 2005;Mitchell et al., 2018).

Attention Mechanisms
An attention mechanism (or attender) takes as input a matrix of keys K := {k T s } ns s=1 ∈ R ns×d and a query q t ∈ R d , and outputs a probability mass function called attention α α α t ∈ R ns that will weight a set of values V := {v T s } ns s=1 ∈ R ns×dv to generate a glimpse vector g t ∈ R dv used for downstream tasks. Following (Graves et al., 2014), it is useful to think of the attender as a memory access module, α α α t as the soft address and g t as the accessed vector in memory.
g t := ns s=1 v s attender(k s , q t ) = Vα α α t (1) Figure 2 illustrates attention in a recurrent seq2seq , which we will use for our experiments. Both the keys and the values correspond to the set of encoder hidden states K = V = E := {e T s } ns s=1 , while the query corresponds to the current decoder hidden state q t = d t .

Content Attention
Most attention mechanisms compute "content based addressing" (associative memory) based on (partial) matches of the the key and query. They take as input K and q t and output a semantic based attention γ γ γ t ∈ R ns .
Intuitively, content attention enables the network to keep track of high-level concepts and then accesses the details through partial semantic matching. For example, if you were to read a paper about multiple breeds of dogs and then translate it, you might not remember the exact breeds, but you will remember that the text is about dogs. At translation time you would go back to the text and translate the exact breeds by knowing what to look for.
A number of methods have been proposed, they usually differ in a score that quantifies the match between k s , q t through multinomial logits:  k T s q t Multiplicative Luong et al. (2015) k T s qt √ d S. Dot Prod. Vaswani et al. (2017) ( Where x is a shorthand for W x.

Location Attention
A location (or position) attention mechanism computes "location based addressing" (random access memory) based on the index of the key. It takes as input q t and outputs a location attention λ λ λ t ∈ R ns . Intuitively, it decides which value to retrieve based on its index. For example at the second decoding step of a German to English translation, you might want to attend to the last word in the German source sentence (the verb) regardless of its identity. In language, it seems crucial to attend to words based on their positions. Yet, position-based attention mechanisms are uncommon in recurrent seq2seq as initial experiments in this direction were discouraging (Luong et al., 2015).
In non-recurrent NLP frameworks, successful attention mechanisms with a non-trainable position component have been proposed. 2 The most relevant works for this paper are based on sinusoidal encodings (SE), which have been proposed to take into account the word positions while bypassing the need for recurrences in encoder-decoder frameworks. Specifically, we will consider the transformer and transformerXL (relative SE) attention, which are computed as follows.
TransformerXL ( Where p t is a positional encoding with sinusoidals of different frequencies at every dimension.

Extrapolation Desiderata
We would like a model which, contrary to the previously outlined attention mechanisms, can extrapolate to sequences longer than the longest training one n D (Extrapolation Constraint). As previously discussed, models with inputs or temporary representations in I extra will very likely break. To satisfy the extrapolation constraint, neural models should thus not depend on features that take values in I extra for sequences longer than n D . As a result, we posit that recurrent seq2seq models with content attention will not be extrapolatable. Indeed, such models must learn to attend to certain indices by encoding a positional embedding in the hidden states of the encoder through some sort of internal "counter". Once in the extrapolation regime, it is extremely unlikely that this counter will work, thereby breaking the model. Furthermore, our model should be able to learn very complex positional attention patterns (Positional Patterns Constraint). For example, the main issue we foresee with the sinusoidal encodings is their lack of ability to model location patterns that depend on general word position such as "look at the i th word (after ...)". Indeed, the sinusoidal encoding for any fixed offset p t+k is linear in p t but not in k. 3 Finally, although the position of words in a sentence is important, many tasks depend on their semantics. The model should thus still be able to learn content-based attention patterns (Content Patterns Constraint).

Model
In this section, we first propose a location attender that can satisfy the extrapolation and positional patterns constraint. We then discuss how to incorporate content attention to satisfy the content patterns constraint.

Location Attender
We would like our position attention to be loosely reminiscent of human attention, whereby we sequentially focus on a single area of the input (e.g. words or pixels) but vaguely perceive neighboring inputs due to the eccentricity effect (Carrasco et al., 1995). The visual acuity of humans is uni-modal, symmetric and spikes at the fovea which corresponds to a 0 • retinal eccentricity. We model this visual acuity using a Gaussian Probability Density Function (PDF). 4 At each step t, the Location Attender generates µ t and σ t , which are used to compute the location attention given the values of the PDFs at the indices of the keys: We enforce the extrapolation constraint by using a relative position of the word source s ns−1 , such that µ t ∈ [0, 1] independent of n s . This model unfortunately fails to satisfy the positional patterns constraint, as it only allows patterns of attention based on percentile position rather than absolute positions. E.g. it can attend to the 10%percentile word but not to the 27 th word. We circumvent this problem by defining a small set of scalable building blocks b t that will be weighted by ρ ρ ρ t to generate a general µ t . The three building blocks are: •ᾱ α α t−1 the average position of the previous attention. ρ (α) t ∈ {0, 1} gates the last generated attention.
• 1 ns−1 the relative step size between words. ρ (1/ns) t ∈ Z dictates the additional number of steps to take.
• 1 the bias term. ρ (1) t ∈ {0, 0.5, 1} is used to attend to the end/middle/start of a sequence. Figure 3: Proposed Location Attender. Given a resized query, the Weighter outputs the standard deviation σ t and ρ ρ ρ t which will weight the building blocks b t to compute the mean µ t . µ t and σ t parametrize a Gaussian PDF used to compute the location attention λ λ λ t .
The weights ρ ρ ρ t are generated using a Gated Recurrent Unit (GRU) . µ t is clamped to [0, 1] by a linear function to yield interpretable and extrapolatable behaviour. We also force σ t > min σ and normalize it by n s which respectively avoids division by 0 and makes σ t comparable regardless of n s . A graphical overview of the Location Attender can be seen in Figure 3. Formally: )} i and = 0.001 and min σ = 0.27.
is the activation function that forces each of the three dimensions of ρ ρ ρ t to take the correct (discrete) values. To make the model fully differentiable, we use the following deterministic continuous relaxation of the required step function activation 5 ( Figure 4): The specific activations are derived from it through clamping: G (α) (x) = clamp(G(x), 0, 1), G (1/ns) = G, G (1) (x) = clamp(G(x * 2), 0, 2)/2.

Mix Attender
We enforce the content patterns constraint, by using a convex combination of content and location attention ( Figure 5):

Datasets
The fact that humans generate and understand unbounded sentences with a finite experience is often used as proof of the principle of compositionality (Szab, 2017). Following this argument, methods that can extrapolate to longer sequences should exhibit some kind of compositionality.
Based on this observation, we evaluate on a compositionality-specific artificial task, lookup tables (Liska et al., 2018), but extend it to better quantify extrapolation. This task is especially interesting to us, as there is a clear notion of what a good attention pattern should look like, making it easy to qualitatively and quantitatively analyze attentive models. Finally, it is a well-controlled task, which allows us to uncover challenges that prevent models from extrapolating on real-world data. Tables   Input  Target  Target Attention 000 t1 . 000 110 <eos> 0 1 2 110 t1 .
General extrapolatable seq2seq models should be able to "say" when to terminate by outputting an end of sentence token <eos>. We thus append <eos> to the targets and a full stop . to the inputs. 6 At each decoding step, the target only depends on the previous output and the current lookup table. E.g. the last decoding step of 000 t1 t2, only depends on the previous output 110 = t 1 (000) and the current table t 2 . The network thus has to learn the lookup table mappings and use the correct one at each step. The gold standard attention therefore corresponds to the position of the current lookup table. Table 1 illustrates a longer example and its correct attention.
The various train and test sets are generated by composing 6 random lookup tables t 1 , . . . , t 6 that have as input and output one of the 2 3 = 8 possible 3-bit strings. Specifically, we use k = 1 . . . 4 composed tables in the training set, k = 2 . . . 4 for the interpolation test sets, and k = 5 . . . 9 for the extrapolation test sets.
There are 5 different extrapolation test sets, depending on their additional lengths compared to the maximum training examples (long 1, . . . , long 5). We randomly select only 5000 possible examples for each of these test sets.
For the interpolation test sets, we select 3000 examples from all possible input-output pairs.
The training set contains all other possible inputoutput pairs, approximately 10000 examples.

Reversed Lookup Tables
To test whether the attention can generate more complex patterns (investigating the Positional Patterns Constraint), we also introduce a dataset which reverses the order of the inputs in the previous dataset. E.g. the last example in Table 1, would be written as t2 t1 t1 000 ., the target would not change, and the attention pattern should be 3 2 1 0 4 (attend to . when outputting <eos>). Although the change seems minor, we hypothesize that such a setting will be much more complicated as the attention pattern is not monotonic and does not follow the encoding nor the decoding steps. Indeed, in the previous task, the model only needs to learn to match the i th decoding step with the i th encoding step.
110 100 <eos> 0 2 3 000 t6 t3 ! t1 t1 t2 . 000 110 110 100 <eos> 0 3 4 5 6 7 Finally, we introduce another variant that also requires content attention (investigating the Content Patterns Constraint). To do so, we augment each training example with a start token "!" between the input and the tables in the source sequence. We then add m ∼ U {0, 10} tables t i before the start token. The target outputs were not modified and are thus independent of the added tables. Solving this task requires to first attend to the input, then to the token which follows "!" (content attention) and finally proceed with incremental location attention. Examples of the training data are given in Table 2.

Metrics
The main metric is sequence accuracy (seqAcc), which corresponds to the accuracy of predicting the entire sequence correctly (including its length).
To get insights about how the model works we will also use two other losses.
Sequence Accuracy Before Eos (seqAccBE), which only evaluates the accuracy of the subsequence before the model generated a <eos>.
Attention Loss (attnLoss), which quantifies the quality of the attention pattern before <eos>. It is computed as the mean squared error between the outputted and gold standard attention. 7 The attention loss gives an indication of how far the model is to the ideal attention patterns required to solve the sequence.

Architecture and Baselines
Concerning baselines, we use three content attention: additive, multiplicative, scaled dot product (Eq.3). We also have two mixed content-location attention baselines: Transformer and TransformerXL (Eq.4).
To focus on the attention mechanisms, our model and the baselines all use a smaller version of the best performing recurrent seq2seq architecture on the lookup table task (Hupkes et al., 2018). The model has never been modified during our experimentation's and is schematized in Figure 2. The embeddings are of dimension 64, the recurrent network is a GRU  with a hidden size of 128, 50% dropout (Srivastava et al., 2014) is applied on the encoder-decoder bottleneck, and a residual connection is used between the inputs (embeddings) and outputs of the encoder. Training consists of 50 epochs with the Adam (Kingma and Ba, 2015) optimizer.

Interpolation
For sanity check, we tested all the baselines and our models (with and without attention mix) on the interpolation setting of the three tasks. Our models and the best baseline (transformer attention) achieved 100% sequence accuracy (seqAcc).

Extrapolation Constraint
The major desired property of our model is to be able to extrapolate. We tested the extrapolation capacity of our location attender by evaluating its seqAcc on the long lookup table extrapolation test sets. Figure 6 shows the seqAcc of the location attender against the strongest baseline (transformer attention).
As hypothesized, the transformer attention has some extrapolation capacity, but our location attender substantially outperforms it in this simple task. Importantly, the loss in performance in the extrapolation setting for the best baseline is abrupt and goes from 100% to 0% by adding only 3 tokens to the inputs. This suggests that commonly used models are brittle and cannot even extrapolate by a small amount.
Although the previous results are encouraging, we would like to understand what is holding back our model from perfectly extrapolating ( Figure 6).
To do so, we computed the sequence accuracy before <eos> (SeqAccBE). Figure 7 shows that  the models outputs are always correct but that it often terminates decoding too soon, which we will refer to as the <eos> problem. This suggests that the decoder keeps an internal "counter" to increase the probability of outputting <eos> when the decoding step is greater than the ones seen at training time. The model learns this heuristic, which is always correct during training time and can be thought of as a metric hacking. Importantly, it is not a "hard" boundary: the model is often able to extrapolate a couple of steps but usually stops before the correct number of steps.

Positional and Content Patterns Constraint
Having shown that our model can extrapolate well on a simple task, we would like to investigate whether it can do so for tasks that require more complicated attention patterns such as the reversed and noisy task. Although the Mix Attender, outperformed all baselines on both tasks, it was not able to get more than 40% and 5% sequence accuracy for long 1 and long 2 respectively.  Figure 8 shows that when considering seqAccBE, the Mix Attender is able to extrapolate well in the noisy setting and a little in the reverse setting. This suggests that it is not able to extrapolate well when considering sequence accuracy because it strongly suffers from the <eos> problem. This is a recurrent problem in our experiments and is more likely to happen in harder tasks and larger models.

Attention Pattern
As previously discussed, variants of the lookup table task are especially interesting as we know the gold standard attention pattern. This enables evaluation of attention patterns through the MSE attention loss (attnLoss).

Attention
Interp    3 shows the attention loss averaged over the three tasks. Although not perfect, the Mix Attender performs on average the best across all settings. 8 Crucially, it performs similarly in an interpolation setting and simple extrapolation setting (long 1), while all other baselines perform significantly worst after adding a single token. Even in long 2, it is competitive with all other attention mechanisms in their interpolation domain. This indicates that the model is indeed able to extrapolate by being more precise with its attention pattern. In addition to enabling extrapolation, the temporary variables such as the weight given to each building block are very helpful for debugging the model and improving interpretability. Figure 9 shows an example output by the Mix Attender for the lookup tables with noisy start task. The input was sampled from the Long 4 test set. The top left image shows the final attention, the top right table shows the value of some interpretable variables at every decoding step. The bottom images correspond to the content and location attention.

Qualitative Analysis
The first decoding step uses location attention to attend to the first input. For the next three steps, the model outputs a mixing weight % (λ) ≈ 0 to focus on content attention. The content attention successfully finds the first non-noisy table (after !). 9 It then goes back to using the location attention 8 Some baselines outperformed it in the interpolation settings of specific tasks. Namely, the additive attention in the reversed task and transformer in the noisy task. 9 A single step of content attention should be sufficient but the model seems to consistently use three steps. with ρ (α) = 1 and ρ (1/n) = 1 to generate a diagonal attention. Finally, it predicts <eos> when attending to the end of the input ".".
At each step σ = min σ as it does not need to attend to neighboring words for this task. % (λ) is never exactly 0 or 1, such that the model can easily learn to switch between content and location attention as it does not collapse to using a single form of attention.

Conclusion and Future Work
Extrapolation is a hard setting that has been investigated surprisingly little by the machine learning community. As current methods that memorize and learn superficial cues are unable to extrapolate while humans are, we posit that such a setting might help (and force) the field to come up with more human-like computational models that are capable of abstract reasoning.
In this paper, we focused on one type of extrapolation which is especially important in NLP: generalizing to longer sequences.
We show that recurrent seq2seq models with common attention mechanisms are unable to extrapolate well. To overcome this issue, we propose a new location-based attention, and show that it can extrapolate well in simple tasks and learn various attention patterns.
Despite the promising initial results, our model is still unable to extrapolate perfectly for harder tasks. By analyzing its behavior, we uncovered an interesting heuristic used by seq2seq models, namely that they keep track of a decoding "counter" to know when to output the <eos> token. This is a bottleneck for extrapolation, suggesting that removing this heuristic is key to reaching perfect extrapolation and should be investigated in future work.
Once the <eos> problem solved, we could test the model on real-world datasets. It would also be interesting to test such attention mechanisms in self-attentive seq2seq models without recurrence. Finally, as the location attender is not model dependent it could be pretrained on complex location patterns and incorporated as a plug-and-play module to get extrapolatable position attention.