Learning attention for historical text normalization by learning to pronounce

Automated processing of historical texts often relies on pre-normalization to modern word forms. Training encoder-decoder architectures to solve such problems typically requires a lot of training data, which is not available for the named task. We address this problem by using several novel encoder-decoder architectures, including a multi-task learning (MTL) architecture using a grapheme-to-phoneme dictionary as auxiliary data, pushing the state-of-the-art by an absolute 2% increase in performance. We analyze the induced models across 44 different texts from Early New High German. Interestingly, we observe that, as previously conjectured, multi-task learning can learn to focus attention during decoding, in ways remarkably similar to recently proposed attention mechanisms. This, we believe, is an important step toward understanding how MTL works.


Introduction
There is a growing interest in automated processing of historical documents, as evidenced by the growing field of digital humanities and the increasing number of digitally available collections of historical documents. A common approach to deal with the high amount of variance often found in this type of data is to perform spelling normalization (Piotrowski, 2012), which is the mapping of historical spelling variants to standardized/modernized forms (e.g. vnd → und 'and').
Training data for supervised learning of historical text normalization is typically scarce, making it a challenging task for neural architectures, which typically require large amounts of labeled data. Nevertheless, we explore framing the spelling normalization task as a character-based sequence-to-sequence transduction problem, and use encoder-decoder recurrent neural networks (RNNs) to induce our transduction models. This is similar to models that have been proposed for neural machine translation (e.g., Cho et al. (2014)), so essentially, our approach could also be considered a specific case of character-based neural machine translation. By basing our model on individual characters as input, we keep the vocabulary size small, which in turn reduces the model's complexity and the amount of data required to train it effectively. Using an encoder-decoder architecture removes the need for an explicit character alignment between historical and modern wordforms. Furthermore, we explore using an auxiliary task for which data is more readily available, namely grapheme-tophoneme mapping (word pronunciation), to regularize the induction of the normalization models.
We propose several architectures, including multi-task learning architectures taking advantage of the auxiliary data, and evaluate them across 44 small datasets from Early New High German.
Contributions Our contributions are as follows: • We are, to the best of our knowledge, the first to propose and evaluate encoder-decoder architectures for historical text normalization.
• We evaluate several such architectures across 44 datasets of Early New High German.
• We show that such architectures benefit from bidirectional encoding, beam search, and attention.
• We also show that MTL with pronunciation as an auxiliary task improves the performance of architectures without attention.
• We analyze the above architectures and show that the MTL architecture learns attention from the auxiliary task, making the attention mechanism largely redundant.
In sum, we both push the state-of-the-art in historical text normalization and present an analysis that, we believe, brings us a step further in understanding the benefits of multi-task learning.

Datasets
Normalization For the normalization task, we use a total of 44 texts from the Anselm corpus (Dipper and Schultz-Balluff, 2013) of Early New High German. 1 The corpus is a collection of manuscripts and prints of the same core text, a religious treatise. Although the texts are semi-parallel and share some vocabulary, they were written in different time periods (between the 14th and 16th century) as well as different dialectal regions, and show quite diverse spelling characteristics. For example, the modern German word Frau 'woman' can be spelled as fraw/vraw (Me), frawe (N2), frauwe (St), fraüwe (B2), frow (Stu), vrowe (Ka), vorwe (Sa), or vrouwe (B), among others. 2 All texts in the Anselm corpus are manually annotated with gold-standard normalizations following guidelines described in Krasselt et al. (2015). For our experiments, we excluded texts from the corpus that are shorter than 4,000 tokens, as well as a few for which annotations were not yet available at the time of writing (mostly Low German and Dutch versions). Nonetheless, the remaining 44 texts are still quite short for machine-learning standards, ranging from about 4,200 to 13,200 tokens, with an average length of 7,350 tokens.
For all texts, we removed tokens that consisted solely of punctuation characters. We also lowercase all characters, since it helps keep the size of the vocabulary low, and uppercasing of words is usually not very consistent in historical texts. Tokenization was not an issue for pre-processing these texts, since modern token boundaries have already been marked by the transcribers. 1 https://www.linguistics.rub.de/ anselm/ 2 We refer to individual texts using the same internal IDs that are found in the Anselm corpus (cf. the website).
Grapheme-to-phoneme mappings We use learning to pronounce as our auxiliary task. This task consists of learning mappings from sequences of graphemes to the corresponding sequences of phonemes. We use the German part of the CELEX lexical database (Baayen et al., 1995), particularly the database of phonetic transcriptions of German wordforms. The database contains a total of 365,530 wordforms with transcriptions in DISC format, which assigns one character to each distinct phonological segment (including affricates and diphthongs). For example, the word Jungfrau 'virgin' is represented as 'jUN-frB.

Base model
We propose several architectures that are extensions of a base neural network architecture, closely following the sequence-to-sequence model proposed by Sutskever et al. (2014). It consists of the following: • an embedding layer that maps one-hot input vectors to dense vectors; • an encoder RNN that transforms the input sequence to an intermediate vector of fixed dimensionality; • a decoder RNN whose hidden state is initialized with the intermediate vector, and which is fed the output prediction of one timestep as the input for the next one; and • a final dense layer with a softmax activation which takes the decoder's output and generates a probability distribution over the output classes at each timestep.
For the encoder/decoder RNNs, we use long short-term memory units (LSTM) (Hochreiter and Schmidhuber, 1997). LSTMs are designed to allow recurrent networks to better learn long-term dependencies, and have proven advantageous to standard RNNs on many tasks. We found no significant advantage from stacking multiple LSTM layers for our task, so we use the simplest competitive model with only a single LSTM unit for both encoder and decoder.
By using this encoder-decoder model, we avoid the need to generate explicit alignments between the input and output sequences, which would bring up the question of how to deal with input/output pairs of different lengths. Another important property is that the model does not start to generate any output until it has seen the full input sequence, which in theory allows it to learn from any part of the input, without being restricted to fixed context windows. An example illustration of the unrolled network is shown in Fig. 1.

Training
During training, the encoder inputs are the historical wordforms, while the decoder inputs correspond to the correct modern target wordforms. We then train each model by minimizing the crossentropy loss across all output characters; i.e., if y = (y 1 , ..., y n ) is the correct output word (as a list of one-hot vectors of output characters) and y = (ŷ 1 , ...,ŷ n ) is the model's output, we minimize the mean loss − n i=1 y i logŷ i over all training samples. For the optimization, we use the Adam algorithm (Kingma and Ba, 2015) with a learning rate of 0.003.
To reduce computational complexity, we also set a maximum word length of 14, and filter all training samples where either the input or output word is longer than 14 characters. This only affects 172 samples across the whole dataset, and is only done during training. In other words, we evaluate our models across all the test examples.

Decoding
For prediction, our base model generates output character sequences in a greedy fashion, selecting the character with the highest probability at each timestep. This works fairly well, but the greedy approach can yield suboptimal global picks, in which each individual character is sensibly derived from the input, but the overall word is non-sensical. We therefore also experiment with beam search decoding, setting the beam size to 5.
Finally, we also experiment with using a lexical filter during the decoding step. Here, before picking the next 5 most likely characters during beam search, we remove all characters that would lead to a string not covered by the lexicon. This is again intended to reduce the occurrence of nonsensical outputs. For the lexicon, we use all word forms from CELEX (cf. Sec. 2) plus the target word forms from the training set. 3

Attention
In our base architecture, we assume that we can decode from a single vector encoding of the input sequence. This is a strong assumption, especially with long input sequences. Attention mechanisms give us more flexibility. The idea is that instead of encoding the entire input sequence into a fixedlength vector, we allow the decoder to "attend" to different parts of the input character sequence at each time step of the output generation. Importantly, we let the model learn what to attend to based on the input sequence and what it has produced so far.
Our implementation is identical to the decoder with soft attention described by Xu et al. (2015). If a = (a 1 , ..., a n ) is the encoder's output and h t is the decoder's hidden state at timestep t, we first calculate a context vectorẑ t as a weighted combination of the output vectors a i : The weights α i are derived by feeding the encoder's output and the decoder's hidden state from the previous timestep into a multilayer perceptron, called the attention model (f att ): We then modify the decoder by conditioning its internal states not only on the previous hidden state h t−1 and the previously predicted output character y t−1 , but also on the context vectorẑ t : In Eq. 3, we follow the traditional LSTM description consisting of input gate i t , forget gate f t , output gate o t , cell state c t and hidden state h t , where W and b are trainable parameters.
For all experiments including an attentional decoder, we use a bi-directional encoder, comprised of one LSTM layer that reads the input sequence normally and another LSTM layer that reads it backwards, and attend over the concatenated outputs of these two layers.
While a precise alignment of input and output sequences is sometimes difficult, most of the time the sequences align in a sequential order, which can be exploited by an attentional component.

Multi-task learning
Finally, we introduce a variant of the base architecture, with or without beam search, that does multi-task learning (Caruana, 1993). The multitask architecture only differs from the base architecture in having two classifier functions at the outer layer, one for each of our two tasks. Our auxiliary task is to predict a sequence of phonemes as the correct pronunciation of an input sequence of graphemes. This choice is motivated by the relationship between phonology and orthography, in particular the observation that spelling variation often stems from phonological variation.
We train our multi-task learning architecture by alternating between the two tasks, sampling one instance of the auxiliary task for each training sample of the main task. We use the encoderdecoder to generate a corresponding output se-quence, whether a modern word form or a pronunciation. Doing so, we suffer a loss with respect to the true output sequence and update the model parameters. The update for a sample from a specific task affects the parameters of corresponding classifier function, as well as all the parameters of the shared hidden layers.

Hyperparameters
We used a single manuscript (B) for manually evaluating and setting the hyperparameters. This manuscript is left out of the averages reported below. We believe that using a single manuscript for development, and using the same hyperparameters across all manuscripts, is more realistic, as we often do not have enough data in historical text normalization to reliably tune hyperparameters.
For the final evaluation, we set the size of the embedding and the recurrent LSTM layers to 128, applied a dropout of 0.3 to the input of each recurrent layer, and trained the model on mini-batches with 50 samples each for a total of 50 epochs (in the multi-task learning setup, mini-batches contain 50 samples of each task, and epochs are counted by the size of the training set for the main task only). All these parameters were set on the B manuscript alone.

Implementation
We implemented all of the models in Keras (Chollet, 2015). Any parameters not explicitly described here were left at their default values in Keras v1.0.8.

Evaluation
We split up each text into three parts, using 1,000 tokens each for a test set and a development set (that is not currently used), and the remainder of the text (between 2,000 and 11,000 tokens) for training. We then train and evaluate on each of the 43 texts (excluding the B text that was used for hyper-parameter tuning) individually.
Baselines We compare our architectures to several competitive baselines. Our first baseline is an averaged perceptron model trained to predict output character n-grams for each input character, after using Levenshtein alignment with generated segment distances (Wieling et al., 2009, Sec. 3.3) to align input and output characters. Our second baseline uses the same alignment, but trains a deep bi-LSTM sequential tagger, following Bollmann and Søgaard (2016). We evaluate this tagger using both standard and multi-task learning. Finally, we compare our model to the rule-based and Levenshtein-based algorithms provided by the Norma tool (Bollmann, 2012). 4

Word accuracy
We use word-level accuracy as our evaluation metric. While we also measure character-level metrics, minor differences on character level can cause large differences in downstream applications, so we believe that perfectly matching the output sequences is more useful. Average scores across all 43 texts are presented in Table 1 (see Appendix A for individual scores).
We first see that almost all our encoder-decoder architectures perform significantly better than the four state-of-the-art baselines. All our architectures perform better than Norma and the averaged perceptron, and all the MTL architectures outperform Bollmann and Søgaard (2016).
We also see that beam search, filtering, and attention lead to cumulative gains in the context of the single-task architecture -with the best architecture outperforming the state-of-the-art by almost 3% in absolute terms. For our multi-task architecture, we also observe gains when we add beam search and filtering, but importantly, adding attention does not help. In fact, attention hurts the performance of our multitask architecture quite significantly. Also note that the multi-task architecture without attention performs on-par with the single-task architecture with attention.
We hypothesize that the reason for this pattern, which is not only observed in the average scores in Table 1, but also quite consistent across the individual results in Appendix A, is that our multi-task learning already learns how to focus attention. This is the hypothesis that we will try to validate in Sec. 5: That multi-task learning can induce strategies for focusing attention comparable to attention strategies for recurrent neural networks.
Sample predictions A small selection of predictions from our models is shown in Table 2. They serve to illustrate the effects of the various settings; e.g., the base model with greedy search tends to produce more nonsense words (ters, ünsget) than the others. Using a lexical filter helps the most in this regard: the base model with filtering correctly normalizes ergieng to erging '(he) fared', while decoding without a filter produces the non-word erbiggen. Even for herczenlichen (modern herzlichen 'heartfelt'), where no model finds the correct target form, only the model with filtering produces a somewhat reasonable alternative (herzgeliebtes 'heartily loved').
In some cases (such as gewarnet 'warned'),  only the models with attention or multi-task learning produce the correct normalization, but even when they are wrong, they often agree on the prediction (e.g. dicke, herzel). We will investigate this property further in Sec. 5.

Learned vector representations
To gain further insights into our model, we created t-SNE projections (Maaten and Hinton, 2008) of vector representations learned on the M4 text. Fig. 2 shows the learned character embeddings. In the representations from the base model (Fig. 2a), characters that are often normalized to the same target character are indeed grouped closely together: e.g., historical <v> and <u> (and, to a smaller extent, <f>) are often used interchangeably in the M4 text. Note the wide separation of <n> and <m>, which is a feature of M4 that does not hold true for all of the texts, as these do not always display a clear distinction between nasals. On the other hand, the MTL model shows a better generalization of the training data (Fig. 2b): here, <u> is grouped closer to other vowel characters and far away from <v>/<f>. Also, <n> and <m> are now in close proximity.
We can also visualize the internal word representations that are produced by the encoder (Fig. 3). Here, we chose words that demonstrate the interchangeable use of <u> and <v>. Historical vnd, vns, vmb become modern und, uns, um, changing the <v> to <u>. However, the representation of vmb learned by the base model is closer to forms like von, vor, uor, all starting with <v> in the target normalization. In the MTL model, however, these examples are indeed clustered together.
5 Analysis: Multi-task learning helps focus attention Table 1 shows that models which employ either an attention mechanism or multi-task learning obtain similar improvements in word accuracy. However, we observe a decline in word accuracy for models that combine multi-task learning with attention.
A possible interpretation of this counterintuitive pattern might be that attention and MTL, to some degree, learn similar functions of the input data, a conjecture by Caruana (1998). We put this hypothesis to the test by closely investigating properties of the individual models below.

Model parameters
First, we are interested in the weight parameters of the final layer that transforms the decoder output to class probabilities. We consider these parameters for our standard encoder-decoder model and compare them to the weights that are learned by the attention and multi-task models, respectively. 5 Note that hidden layer parameters are not necessarily comparable across models, but with a fixed seed, differences in parameters over a reference model may be (and are, in our case). With a fixed seed, and iterating over data points in the same order, it is conceivable the two non-baselines end up in roughly the same alternative local optimum (or at least take comparable routes).
We observe that the weight differences between the standard and the attention model correlate with the differences between the standard and multitask model by a Pearson's r of 0.346, averaged across datasets, with a standard deviation of 0.315; on individual datasets, correlation coefficient is as  high as 96. Figure 4 illustrates these highly parallel weight changes for the different models when trained on the N4 dataset.

Final output
Next, we compare the effect that employing either an attention mechanism or multi-task learning has on the actual output of our system. We find that out of the 210.9 word errors that the base model produces on average across all test sets (comprising 1,000 tokens each), attention resolves 47.7, while multi-task learning resolves an average of 45.4 errors. Crucially, the overlap of errors that are resolved by both the attention and the MTL model amounts to 27.7 on average. Attention and multi-task also introduce new errors compared to the base model (26.6 and 29.5 per test set, respectively), and again we can observe a relatively high agreement of the models (11.8 word errors are introduced by both models).
Finally, the attention and multi-task models display a word-level agreement of κ=0.834 (Cohen's kappa), while either of these models is less strongly correlated with the base model (κ=0.817 for attention and κ=0.814 for multi-task learning).

Saliency analysis
Our last analysis regards the saliency of the input timesteps with respect to the predictions of our models. We follow Li et al. (2016) in calculating first-derivative saliency for given input/output pairs and compare the scores from the different models. The higher the saliency of an input timestep, the more important it is in determining the model's prediction at a given output timestep. Therefore, if two models produce similar saliency matrices for a given input/output pair, they have learned to focus on similar parts of the input during the prediction. Our hypothesis is that the attentional and the multi-task learning model should be more similar in terms of saliency scores than either of them compared to the base model. Figure 5 shows a plot of the saliency matrices generated from the word pair czeychen -zeichen 'sign'. Here, the scores for the attentional and the MTL model indeed correlate by ρ = 0.615, while those for the base model do not correlate with either of them. A systematic analysis across 19,000 word pairs (where all models agree on the output) shows that this effect only holds for longer input sequences (≥ 7 characters), with a mean ρ = 0.303 (±0.177) for attentional vs. MTL model, while the base model correlates with either of them by ρ < 0.21.

Related Work
Many traditional approaches to spelling normalization of historical texts use edit distances or some form of character-level rewrite rules, handcrafted (Baron and Rayson, 2008) or learned automatically (Bollmann, 2013;Porta et al., 2013).
A more recent approach is based on characterbased statistical machine translation applied to historical text (Pettersson et al., 2013;Sánchez-Martínez et al., 2013;Scherrer and Erjavec, 2013; or dialectal data (Scherrer and Ljubešić, 2016). This is conceptually very similar to our approach, except that we substitute the classical SMT algorithms for neural networks. Indeed, our models can be seen as a form of character-based neural MT (Cho et al., 2014).
Neural networks have rarely been applied to historical spelling normalization so far. Azawi et al. (2013) normalize old Bible text using bidirectional LSTMs with a layer that performs alignment between input and output wordforms. Bollmann and Søgaard (2016) also use bi-LSTMs to frame spelling normalization as a characterbased sequence labelling task, performing character alignment as a preprocessing step. Multi-task learning was shown to be effective for a variety of NLP tasks, such as POS tagging, chunking, named entity recognition (Collobert et al., 2011) or sentence compression (Klerke et al., 2016). It has also been used in encoderdecoder architectures, typically for machine translation (Dong et al., 2015;Luong et al., 2016), though so far not with attentional decoders.

Conclusion and Future Work
We presented an approach to historical spelling normalization using neural networks with an encoder-decoder architecture, and showed that it consistently outperforms several existing baselines. Encouragingly, our work proves to be fully competitive with the sequence-labeling approach by Bollmann and Søgaard (2016), without requiring a prior character alignment.
Specifically, we demonstrated the aptitude of multi-task learning to mitigate the shortage of training data for the named task. We included a multifaceted analysis of the effects that MTL introduces to our models and the resemblance that it bears to attention mechanisms. We believe that this analysis is a valuable contribution to the understanding of MTL approaches also beyond spelling normalization, and we are confident that our observations will stimulate further research into the relationship between MTL and attention.
Finally, many improvements to the presented approach are conceivable, most notably introducing some form of token context to the model. Currently, we only consider word forms in isolation, which is problematic for ambiguous cases (such as jn, which can normalize to in 'in' or ihn 'him') and conceivably makes the task harder for others. Reranking the predictions with a language model could be one possible way to improve on this. , for example, experiment with segment-based normalization, using a character-based SMT model with character input derived from segments (essentially, token ngrams) instead of single tokens, which also intro-duces context. Such an approach could also deal with the issue of tokenization differences between the historical and the modern text, which is another challenge often found in datasets of historical text.

A Supplementary Material
For interested parties, we provide our full evaluation results for each single text in our dataset. Table 3 shows token counts, a rough classification of each text's dialectal region, and the results for the baseline methods.