Multi-task learning for historical text normalization: Size matters

Historical text normalization suffers from small datasets that exhibit high variance, and previous work has shown that multi-task learning can be used to leverage data from related problems in order to obtain more robust models. Previous work has been limited to datasets from a specific language and a specific historical period, and it is not clear whether results generalize. It therefore remains an open problem, when historical text normalization benefits from multi-task learning. We explore the benefits of multi-task learning across 10 different datasets, representing different languages and periods. Our main finding—contrary to what has been observed for other NLP tasks—is that multi-task learning mainly works when target task data is very scarce.


Introduction
Historical text normalization is the problem of translating historical documents written in the absence of modern spelling conventions and making them amenable to search by today's scholars, processable by natural language processing models, and readable to laypeople. In other words, historical text normalization is a text-to-text generation, where the input is a text written centuries ago, and the output is a text that has the same contents, but uses the orthography of modern-day language. In this paper, we limit ourselves to word-by-word normalization, ignoring the syntactic differences between modern-day languages and their historic predecessors.
Resources for historical text normalization are scarce. Even for major languages like English and German, we have very little training data for in-ducing normalization models, and the models we induce may be very specific to these datasets and not scale to writings from other historic periodsor even just writings from another monastery or by another author. Bollmann and Søgaard (2016) and Bollmann et al. (2017) recently showed that we can obtain more robust historical text normalization models by exploiting synergies across historical text normalization datasets and with related tasks. Specifically, Bollmann et al. (2017) showed that multitask learning with German grapheme-to-phoneme translation as an auxiliary task improves a stateof-the-art sequence-to-sequence model for historical text normalization of medieval German manuscripts.

Contributions
We study when multi-task learning leads to improvements in historical text normalization. Specifically, we evaluate a state-ofthe-art approach to historical text normalization (Bollmann et al., 2017) with and without various auxiliary tasks, across 10 historical text normalization datasets. We also include an experiment in English historical text normalization using data from Twitter and a grammatical error correction corpus (FCE) as auxiliary datasets. Across the board, we find that, unlike what has been observed for other NLP tasks, multi-task learning only helps when target task data is scarce.

Datasets
We consider 10 datasets from 8 different languages: German, using the Anselm dataset (taken from Bollmann et al., 2017) and texts from the RIDGES corpus (Odebrecht et al., 2016) Bollmann et al. (2017) to obtain a single dataset. For RIDGES, we use 16 texts and randomly sample 70% of all sentences from each text for the training set, and 15% for the dev/test sets. The Spanish and Portuguese datasets consist of manually normalized subsets of the Post Scriptum corpus; here, we randomly sample 80% (train) and 10% (dev/test) of all sentences per century represented in the corpus. Dataset splits for the other languages are taken from Pettersson (2016) and Ljubešić et al. (2016).
We preprocessed all datasets to remove punctuation, perform Unicode normalization, replace digits that do not require normalization with a dummy symbol, and lowercase all tokens. Table 1 gives an overview of all historical datasets, the approximate time period of historical texts that they cover, as well as the size of the dataset splits. Note that, to the best of our knowledge, the Spanish, Portuguese, and German RIDGES datasets have not been used in the context of automatic historical text normalization before. Table 2 additionally gives some examples of historical word forms and their gold-standard normalizations from each of these datasets. 3

Experimental setup
Model We use the same encoder-decoder architecture with attention as described in Bollmann et al. (2017). 4 This is a fairly standard model consisting of one bidirectional LSTM unit in the encoder and one (unidirectional) LSTM unit in the decoder. The input for the encoder is a single historical word form represented as a sequence of characters and padded with word boundary symbols; i.e., we only input single tokens in isolation, not full sentences. The decoder attends over the encoder's outputs and generates the normalized output characters.
Hyperparameters We use the same hyperparameters across all our experiments: The dimensionality of the embedding layer is 60, the size of the LSTM layers is set to 300, and we use a dropout rate of 0.2. We use the Adam optimizer (Kingma and Ba, 2014) with a character-wise cross-entropy loss. Training is done on mini-batches of 50 samples with early stopping based on validation on the individual development sets. The hyperparameters were set on a randomly selected subset of 50,000 tokens from each of the following datasets: English, German (Anselm), Hungarian, Icelandic, and Slovene (Gaj). Bollmann et al. (2017) also describe a multi-task learning (MTL) scenario where the encoder-decoder model is trained on two datasets in parallel. We perform similar experiments on pairwise combinations of our datasets.  The question we ask here is whether training on pairs of datasets can improve over training on datasets individually, which pairings yield the best results, and what properties of the datasets are most predictive of this. In other words, we are interested in when multi-task learning works.

Multi-task learning
In the multi-task learning setting, the two datasets-or "tasks"-share all parts of the encoder-decoder model except for the final prediction layer, which is specific to each dataset. This way, most parts of the model are forced to learn language-independent representations. This is different from Luong et al. (2015) and related work in machine translation, where typically only the encoder or the decoder is shared. We do not explore these alternatives here.
During training, we iterate over both our datasets in parallel in a random order, with each parameter update now being based on 50 samples from each dataset. Since datasets are of different sizes, we define an epoch to be a fixed size of 50,000 samples. Validation is performed for both datasets after each epoch, and model states are saved independently for each one if its validation accuracy improved. This means that even if the ideal number of epochs is different for the datasets, only the best state for each dataset will be used in the end. Training ends only after the validation accuracy for each dataset has stopped improving.

Sparse data scenario
The training sets in our experiments range from ca. 25,000 to 230,000 tokens. Generally, historical text normalization suffers from scarce resources, and our biggest datasets are considered huge compared to what scholars typically have access to. Creating gold-standard normalizations is cumbersome and expensive, and for many languages and historic periods, it is not feasible to obtain big datasets. Therefore, we also present experiments on reduced datasets; instead of taking the full training sets, we only use the first 5,000 tokens from each one.
In this case, for multi-task learning, we combine the small target datasets with the full auxiliary datasets. This procedure mimics a realistic scenario: If a researcher is interested in normalizing a language for which no manually normalized resource exists, they could conceivably create a small batch of manual normalizations for this language and then leverage an existing corpus in another language using multi-task learning.  Table 3: Normalization accuracy (in percent) using the full or sparse training sets, both for the single-task setup and the best-performing multitask (MTL) setup.

Results
We evaluate our models using normalization accuracy, i.e., the percentage of correctly normalized word forms. Table 3 compares the accuracy scores of our single-task baseline models and for multi-task learning, in both the full and the sparse data scenario. For multi-task learning, we report the test set performance of the best target-auxiliary task pair combination, as evaluated on development data. Figure 1 visualizes the results for all pairwise combinations of the multi-task models; here, we show the error reduction of multitask learning over our single-task baseline to better highlight by how much the MTL setup changes the performance.

Full datasets
We make two observations about the results for the full data scenario (the left side of Fig. 1): (i) the usefulness of multi-task learning depends more on the dataset that is being evaluated than the one it is trained together with; and (ii) for most datasets, multi-task learning is detrimental rather than beneficial. One hypothesis about multi-task learning is that its usefulness correlates with either synergistic or complementary properties of the datasets. In other words, it is conceivable that the performance on one dataset improves most with an MTL setup when it is paired with another dataset that is either (i) very similar, or (ii) provides an additional signal that is useful for, but not covered in, the first dataset. The results in Figure 1 show that there can indeed be considerable variation depending on the exact dataset combination; e.g., the error reduction on Slovene (Bohorič) ranges from 5% (when paired with the Gaj dataset) to 33.2% (when paired with Swedish). At the same time, the question whether multi-task learning helps at all seems to depend mostly on the dataset being evaluated. With few exceptions, for most datasets, the error rate either always improves or always worsens, independently of the auxiliary task.
Considering the dataset statistics in Table 1, it appears that the size of the training corpus is the most important factor for these results. The four corpora that consistently benefit from MTL-German (RIDGES), Icelandic, Slovene (Bohorič), and Swedish-also have the smallest training sets, with about 50,000 tokens or less. For other tasks, different patterns have been observed (Martínez Alonso and Plank, 2017;); see Sec. 5.

Sparse datasets
In the sparse data scenario where only 5,000 tokens are used for training (right side of Fig. 1), MTL almost always leads to improvements over the single-task training setup. This further confirms the hypothesis that multitask learning is beneficial for historical text normalization when the target task dataset is small.

Related work and conclusion
There has been considerable work on multitask sequence-to-sequence models for other tasks (Dong et al., 2015;Luong et al., 2015;Elliott and Kádár, 2017). There is a wide range of design questions and sharing strategies that we ignore here, focusing instead on under what circumstances the approach advocated in (Bollmann et al., 2017) works.
Our main observation-that the size of the target dataset is most predictive of multi-task learning gains-runs counter previous findings for other NLP tasks (Martínez Alonso and Plank, 2017;. Martínez Alonso and Plank (2017) find that the label entropy of the auxiliary dataset is more predictive;  find that the relative differences in the steepness of the two single-task loss curves is more predictive. Both papers consider sequence tagging problems with a small number of labels; and it is probably not a surprise that their findings do not seem to scale to the case of historical text normalization.