Semi-supervised Contextual Historical Text Normalization

Historical text normalization, the task of mapping historical word forms to their modern counterparts, has recently attracted a lot of interest (Bollmann, 2019; Tang et al., 2018; Lusetti et al., 2018; Bollmann et al., 2018;Robertson and Goldwater, 2018; Bollmannet al., 2017; Korchagina, 2017). Yet, virtually all approaches suffer from the two limitations: 1) They consider a fully supervised setup, often with impractically large manually normalized datasets; 2) Normalization happens on words in isolation. By utilizing a simple generative normalization model and obtaining powerful contextualization from the target-side language model, we train accurate models with unlabeled historical data. In realistic training scenarios, our approach often leads to reduction in manually normalized data at the same accuracy levels.


Introduction
Text normalization is the task of mapping texts written in some non-standard variety of language L (a dialect or an earlier diachronic form) to some standardized form, typically the official modern standard variety of L (Table 1). Examples include the normalization of informal Englishlanguage tweets (Han and Baldwin, 2011); quasiphonetic transcriptions of dialectal Swiss German (Samardžić et al., 2015); and historical documents such as religious texts in 15th-century Icelandic (Bollmann et al., 2011;Pettersson et al., 2013b;Ljubešić et al., 2016, inter alia).
Text can to a large extent be normalized by replacing non-standard words with their standard counterparts. Because of this often-made assumption, this task is also known as "lexical" or "spelling normalization" (Han and Baldwin, 2011;Tang et al., 2018).
There has been a lot of interest in historical and dialectal text normalization over the past years. Earlier works attempt type-level normalization by way of search for standardized words (Pettersson et al., 2013a;Bollmann, 2012). More recently, the dominant approach casts the problem as probabilistic type-level character transduction. Most commonly, a fully-supervised machine translation system transduces words in isolation (Bollmann, 2019). The use of context is limited to employing a target-side language model for an improved, contextualized decoding Etxeberria et al., 2016;Jurish, 2010).
In this paper, we develop simple approaches to semi-supervised contextualized text normalization. On the example of historical text normalization, we show that one can reduce the amount of supervision by leveraging unlabeled historical text and utilizing context at training. Our methods build on familiar techniques for semisupervised learning such as generative modeling and expectation-maximization and unify previous work (search, noisy channel, contextualized decoding, neural character-level transduction) in a simple setup.
We experimentally validate the strength of our models on a suite of historical datasets. In addition to the token-level supervision scenario, we show benefits of a more economic supervision by a word-type normalization dictionary.

Historical text normalization
Most normalization approaches attempt to learn a function from historical to modern word types without taking context into consideration. This is based on the observation that morpho-syntactic differences between the non-standard and standard varieties (e.g. in word order, grammatical case distinctions) are negligible and normalization ambi-wermuttsafft in die ohren getropfft t e odtet die w e urme darinnen wermutsaft in die ohren getropft tötet die würmer darin wormwood juice in the ears dripped kills the worms inside Table 1: Historical text normalization. An excerpt from the RIDGES corpus of 15th-17th century German scientific writing (Odebrecht et al., 2017). Top=Early New High German, middle=Modern Standard German.
guity is often not very high. Some earlier works cast text normalization as search over standardized word forms (Pettersson et al., 2013a;Bollmann, 2012). Hand-crafted rules or a string-distance metric (Levenshtein, 1966) with parameters estimated from the labeled data are used to retrieve best matching standard candidates.
Another line of work follows a principled probabilistic solution: a noisy channel model (Shannon, 1948), which consists of a channel p(x | y) and a language model p(y) (Jurish, 2010;Pettersson et al., 2013b;Samardžić et al., 2015;Etxeberria et al., 2016;Scherrer and Ljubešić, 2016). The channel model operates at the character level and takes the form of either a character alignment model (Brown et al., 1993) or a weighted finite-state transducer (WFST, Mohri, 1997). Channel model parameters are estimated from a manually normalized corpus. The language model is often trained on external targetside data. Some works perform normalization of words in context. Jurish (2010) and Etxeberria et al. (2016) decode sequences of historical word tokens by combining a character-level channel with a word-level language model p(y 1:m ). Scherrer and Ljubešić (2016) learn a character alignment model directly over untokenized segments of historical texts.
Numerous neural approaches to text normalization (Tang et al., 2018;Lusetti et al., 2018;Robertson and Goldwater, 2018;Bollmann et al., 2017;Korchagina, 2017) learn a discriminative model p(y | x)parameterized with some generic encoder-decoder neural network-that performs the traditional character-level transduction of isolated words. The models are trained in a supervised fashion on a lot of manually labeled data. For example, Tang et al. (2018) train on tens of thousands of labeled pairs, including for varieties that share more than 75% of their vocabularies. Except Lusetti et al. (2018), who use a target-side language model to rerank base model hypotheses in context, no other approach in this group uses context in any way.

The role of context
If non-standard language exhibits normalization ambiguity, one would expect contextualization to reduce it. For example, historical German "desz" in the RIDGES corpus (Odebrecht et al., 2017) normalizes to three modern word types: "das", "des" (various forms of the definite article), and "dessen" (relative pronoun whose). Knowing the context (e.g. whether the historical word occurs clause-initially or before a neuter noun) would help normalize "desz" correctly. As suggested by , the accuracy of the oracle that normalizes words in isolation by always selecting their most frequent normalization upperbounds the accuracy of non-contextual systems.
Many historical normalization corpora do not have high normalization ambiguity (Table 3). The upper bound on accuracy for non-contextual normalization is 97.0 on average (±0.02) and is above 92.4 for every historical language that we study here, indicating that lexical normalization is a very reasonable strategy.
Even if context may sometimes not be necessary for adequately solving the task in a fully supervised manner, we would expect contextualization to lead to more accurate unsupervised and semi-supervised generative models.

Contextualized generative model
We start off with a generative model in the form of a noisy channel over sequences of words (Eq. 1). The channel model factorizes over non-standard words, and a non-standard word x i depends only on the corresponding standardized word y i . The simple structure of our model follows from the lexical normalization assumption. (1) Compared to a discriminative model, which would directly capture the mapping from non-standard word sequences x 1:m to standardized y 1:m without having to account for how nonstandard data arise, this model offers some important advantages. First, it can be trained by maximizing marginal likelihood p(x 1:m ), which leads to semi-supervised learning. Second, we can use a language model estimated from arbitrary external text.
The only model parameters are the parameters θ of the channel model p θ (x i | y i ). The parameters of the language model p(y 1:m ) are held fixed.

Neural channel
The channel model p(x i | y i ) stochastically maps standardized words to non-standard words. Any type-level normalization model from §2 can be applied here (in the reverse direction from the normalization task).
For our experiments, we use the neural transducer of Makarov and Clematide (2018) as it has been shown to perform strongly on morphological character transduction tasks. Parameterized with a recurrent encoder and decoder, the model defines a conditional distribution over edits p θ (x, a | y) = |a| j=1 p θ (a j | a 1:j−1 , y), where y = y 1 , . . . , y |y| is a standardized word as a character sequence and a = a 1 . . . a |a| an edit action sequence. Using this model as a channel requires computing the marginal likelihood p θ (x | y), which is intractable due to the recurrent decoder. We approximate p θ (x | y) by p θ (x, a * | y), where a * is a minimum cost edit action sequence from y to x. This works well in practice as the network produces a highly peaked distribution with most probability mass placed on minimum cost edit action sequences.

Language model
We consider two language model factorizations, which lead to different learning approaches.
Neural HMM. If the language model is an ngram language model the overall generative model has the form of an n-gram Hidden Markov Model (HMM) with transition probabilities given by the language model and emission probabilities by the channel. HMM has been proposed for this problem before but with different parameterizations (Jurish, 2010;Etxeberria et al., 2016). For simplicity, we use countbased language models in the experiments. Full neural parametarization can be achieved with ngram feedforward neural language models (Bengio et al., 2003).
RNN LM-based model. Our second language model is a word-level recurrent neural language model (RRN-LM, Mikolov, 2012). It does not make any independence assumptions, which increases expressive power yet precludes exact inference in the generative model.

Expectation-maximization
Let U be a set of unlabeled non-standard sentences, V x the set of non-standard word types in U , and V y the vocabulary of the standardized variety. In the unsupervised case, we train by maximizing the marginal likelihood of U with respect to the channel parameters θ: For an n-gram neural HMM, this can be solved using generalized expectation-maximization (GEM, Neal and Hinton, 1998;Berg-Kirkpatrick et al., 2010). We compute the E-step with the forwardbackward algorithm. In the M-step, given the posterior p(y | x) for each non-standard word type x, we maximize the following objective with respect to θ with a variant of stochastic gradient ascent: GEM provably increases the marginal likelihood. We train the RNN LM-based model with hard expectation-maximization (hard EM, Samdani et al., 2012). This is a simple alternative to approximate inference. Hard EM does not guarantee to increase the marginal likelihood, but often works in practice (Spitkovsky et al., 2010). The difference from GEM is the Estep. To compute it, we decode U with beam search. Let B = x 1:m ∈U {y 1:m ∈ V m y | y 1:m is in the beam for x 1:m }. We set the posterior p(Y = y | X = x) to be proportional to the sum of the probabilities of sentence-wise normalizations from B where x gets normalized as y.
Semi-supervised training. We linearly combine the maximum likelihood (MLE) of the set S = {(x, y) i } n i=1 of labeled normalization pairs with the marginal likelihood of U (Eq. 3): λ ≥ 0 controls how much information from U flows into the parameter update. The difference from the unsupervised case is that the M-step computes Eq. 4 scaled with λ and the MLE of S.
In practice, we initially pretrain the channel on the labeled data and then move to full semisupervised training with some non-zero λ fixed for the rest of training.

Proposal of standardized candidates
Candidate set heuristic. Performing EM with the full modern vocabulary V y as the set of possible normalization candidates is vastly impractical: The forward-backward algorithm runs in O(m|V y | n ) time. In related tasks, this has lead to training heuristics such as iterative EM (Ravi and Knight, 2011). To keep this computation manageable, we propose generating a candidate set C(x) of k modern words for each non-standard word x. To this end, we use approximate nearest neighbor search with edit distance (Hulden, 2009). The algorithm efficiently searches through an FST, which encodes a part of the vocabulary, with the A * search. We encode different word frequency bands of the vocabulary as separate FSTs, which we search in parallel. We rerank the candidates taking into account non-standard and standardized words' relative frequencies (see Appendix). Thus, all summations and maximizations over V y are performed over the reduced set C(x).
Our heuristic embodies normalization by search ( §2) and could be replaced with a more informed search and reranking algorithm (Bollmann, 2012;Baron and Rayson, 2008).
Candidate generation with direct model. The candidate set heuristic is too restrictive. It is hard to achieve perfect coverage at manageable candidate set sizes (e.g. if x and target y have no characters in common as e.g. historical German "sy" → "sie" (they)). Worse still, this approach completely fails if the target y does not appear in the corpus. This could be because the corpus is small (e.g. most Wikipedias); rich morphology Algorithm 1 GEM training ( §4.4) Full version uses restarts and candidate pruning (see Appendix). Input: Unlabeled dataset U , labeled dataset S, development set, number of modern candidates k to generate, number of EM epochs K, mixture parameter λ combining the unsupervised and supervised objectives.
for all x ∈ Vx and y ∈ C(x). (Use uniform scores if t = 1 and S = ∅.) 8: for non-standard word sequence x1:m ∈ U do 9: Run forward-backward or beam search ( §4.4) to compute each word's posterior p(Yi | x1:m). 10: for position i ← 1 to m do 11: for all x ∈ Vx and y ∈ C(x).
13: M-step: 14: Start training from θ (t−1) and use p( 16: return θ (t) leading to best accuracy on development set or orthographic conventions lead to a vast number of word types (e.g. Hungarian); or the target word is not even attested in the standardized variety (e.g. "czuhant" → * "zehant" (immediately) in the Anselm historical German corpus (Krasselt et al., 2015)). We, therefore, also consider candidate generation with a direct model q φ (y | x).
We bootstrap a direct model from a contextualized generative model. We fit it by minimizing the cross-entropy of the direct model relative to the posterior of the generative model p(y | x). For a semi-supervised generative model, this combines with the MLE of the labeled set S (κ ≥ 0): Any type-level normalization model from prior work could be used here. We choose the direct model to be a neural transducer, like the channel. It generates candidates using beam search.

Prediction and reranking
We consider two ways of sentence-wise decoding with our generative models. The first uses the maximum a posteriori (MAP) decision rule, which finds a normalization that maximizes the posterior p(y 1:m | x 1:m ). Depending on the factorization of the language model, we solve this exactly (with the Viterbi algorithm) or approximately (with beam search).
The other approach is to learn a reranker model on the development set. The model rescores sentence hypothesesŷ 1:m generated by the base model (with k-best Viterbi or beam search). It uses rich non-local features-the hypothesis' scores under a word-and character-level RNN language models-as well as length-normalized base model score, mean out-of-vocabulary rate and edit distance from x 1:m (see Appendix). We implement a PRO reranker (Hopkins and May, 2011) that uses hamming loss.

Experiments
For our experiments, we use eight datasets compiled by various researchers (Pettersson, 2016; from historical corpora (Tables 2 and 3). 1 Seven languages are Indo-European: Germanic (English, German, Icelandic, and Swedish), Romance (Portuguese and Spanish), and Slavic (Slovene). Additionally, we experiment with Hungarian, a Finno-Ugric language. From the Slovene dataset, we only use the collection of the older and more challenging texts in the Bohorič alphabet.
The data are of different genres (letters, religious and scientific writings). The earliest texts are in 14th-c. Middle English. In many datasets, the proportion of identity normalizations is substantial. The smallest word overlap is in the Hungarian data (18%), the largest is in English (75%).
All corpora are tokenized and aligned at the segment and token level. For some datasets, either segments do not coincide with grammatical sentences, or the data have no segment boundaries at all (e.g. Hungarian or Icelandic). In such cases, to make input amenable to training with context, we resort to sentence splitting on punctuation marks. 1 The datasets are featured in the large-scale study of Bollmann (2019), who conveniently provides most data in a unified format at https://github.com/coastalcph/ histnorm/. We also split very long segments to ensure the maximum segment length of fifty words. Token alignment is largely one-to-one, with rare exceptions. Clitization and set phrases (e.g. German "mu u" → "musst du" (you must), "aller handt" → "allerhand" (every kind of )) are common causes for many-to-one alignments, which our models fail to capture properly.
State-of-the-art. We compare our models to the strongest models for historical text normalization: • the Norma tool (Bollmann, 2012), which implements search over standardized candidates; and • the character-level statistical machine translation model (cSMT, , which uses the Moses toolkit (Koehn et al., 2007). This approach estimates a character n-gram language model on external data and fits a MERT reranker model (Och, 2003) on the development set.
According to Bollmann (2019), Norma performs best in the low-resource setting (≤ 500 labeled tokens), and cSMT should be preferred in all other data conditions. Norma's strong performance in the low-resource scenario derives from the fact that searching for candidates can be fairly easy for some languages e.g. English. The reranker trained on the development set is key to cSMT's strength.
Realistic low-resource setting. Our contextualized models are particularly appealing when labeled data are limited to at most a couple of thousand annotated word pairs. This would be the most common application scenario in practice, and approaches requiring tens of thousands of training  Vaamonde (2017) hu 1440-1541 134.0 16.7 Simon (2014) sl 1750-1840s 50.0 5.8 Erjavec (2012) sv 1527-1812 24.5 2.2 Fiebranz et al. (2011)  samples would be ruled out as unrealistic. We, therefore, experiment with small labeled training set sizes n ranging from 500 to 5K. Additionally, we consider the unsupervised scenario (n = 0), which might be less relevant practically (even a small amount of labeled data might lead to substantial improvement) but allows us to demonstrate most directly the advantage of our approach.
To keep the experiments close to the real-life application scenario (Kann et al., 2019), we additionally cap the size of the development set at 2K tokens. Otherwise, we require that the development set have 500 tokens more than the labeled set S to ensure that we validate on not too small a number of word types (e.g. at 1K tokens, we get only about 600 word types on average).
Finally, the unlabeled set U comprises nonstandard word sequences from all the remaining non-test data. Our sampled development sets are much smaller compared to the original development set from the official data split. Not to waste any data, we also include the historical part of the rest of the original development set into the unlabeled set U . The labeled training set S is sampled uniformly at random from U with targets.
Semi-supervised training with type-level normalization dictionary. Supervision by typelevel dictionary (as opposed to token-level annotations) is a simple and effective way of reducing the amount of manually labeled data (Garrette et al., 2013). We simulate type-level normalization dictionary construction by selecting d most frequent non-standard word types from the original training set. We build a labeled set S by pairing them with the most frequent standard word types that they normalize to. We experiment on German and Slovene. We use a development set of 500 tokens.
Experimental setup. We use Wikipedia dumps for training language models and the candidate set  heuristic. For the neural HMM, we fit count-based bigram language models using KenLM (Heafield et al., 2013). All RNN language models are parameterized with a Long Short-Term Memory cell (Hochreiter and Schmidhuber, 1997) and use dropout regularization (Zaremba et al., 2014). The HMMs use C(x) of 150 candidates, the RNN LMbased models use 50 candidates.
We set the beam size of the RNN LM-based models to four for both final decoding and the Estep. For reranking, the base HMMs output 150 kbest sentence hypotheses and the RNN LM-based models output the beam. The reranker models are trained with the perceptron algorithm.
The direct models are trained with AdaDelta. We decode them with beam search and rerank the beam with a PRO reranker using the channel and direct model scores and relative frequency as features. We use the top two reranked candidates as the new candidate set. We refer the reader to the Appendix for further details on training.
We train Norma and cSMT on our data splits using the training settings suggested by the authors.

Discussion
The semi-supervised contextualized approach results in consistent improvements for most languages and labeled data sizes (Tables 4 and 5).  Compared to cSMT, an average error reduction ranges from 19% (n = 500) to almost 3% (n = 5K) or 8% excluding Hungarian, the language on which the models perform worst. Reranking provides an important boost (almost 5% error reduction compared to the base model, and almost 8% for neural HMMs), and bootstrapping direct model candidates results in even better performance (almost 14% error reduction).
Unsupervised case. Remarkably, with no labeled training data (and only a 500-token labeled development set), the best configuration achieves 88.4% of the top scores reported for fully supervised models (Table 2 of Bollmann (2019)). It outperforms the Norma baseline trained on n = 1K labeled samples, reducing its error by almost 4%.
Effects of unlabeled dataset size. We typically see strong performance for languages where the unlabeled dataset U is large (≈ official training and development sets together, Table 2). This in-cludes English, that shows little ambiguity ( Table 3) and so would be expected to profit less from contextualization.
Effects of the modern corpus and preprocessing. The size and coverage of the Wikipedia dump (Table 3) for Icelandic and particularly Hungarian degrade the models' performance and are likely the key reason why cSMT outperforms all contextual models for Hungarian as the labeled dataset increases (n = 2.5K and n = 5K), despite the large amount of unlabeled Hungarian text. The RNN LM-based models are hit worst due to the poorest coverage. The lack of original segment boundaries (Table 3, Icelandic is only partially segmented) further exacerbates performance. Remarkably, the overall approach works despite language models and candidate sets using out-ofdomain standardized data. Leveraging in-domain data such as collections of literary works from the time period of the source historical text could lead to even better performance (Berg-Kirkpatrick  Candidate generation with direct model. Generating candidates with the direct model leads to large gains for languages with poor coverage (Icelandic and Hungarian RNN LM-based models see an average error reduction of over 20% and 14% respectively). At larger labeled dataset sizes (Table 5), bootstrapping a direct model and reranking its output without context becomes an effective strategy (Icelandic, Portuguese).
Normalization ambiguity. We would expect languages with higher normalization ambiguity to profit from contextualization . German, Portuguese, and Spanish gain even in the most competitive semi-supervised 5K condition, consistent with the amount of ambiguity they exhibit (Table 3). Losses and modest gains are observed for languages with the lowest ambiguity rates (Slovene, Swedish). We look at the accuracies on unambiguous and ambiguous normalizations (Figure 1). The contextual model consistently outperforms cSMT on ambiguous tokens, often by a wide margin and even when cSMT is better overall (Slovene). An extreme case is German at n = 5K, where the two approaches perform similarly on unambiguous tokens, yet cSMT suffers considerably on ambiguous ones (38% error reduction by the neural HMM). German ranks second by normalization ambiguity (Table 3).
Type-level normalization dictionary. We observe gains equivalent to using a token-level dataset of at least double the dictionary size (Table 6). Slovene profits a lot from dictionary supervision, with 1K-type model performing close to the 5K-token model.  Shortcomings of the approach. The general problem of our approach, as well as most approaches that we build on, is reliance on gold tokenization. Overall, we have faced minor issues with tokenization (one notable example is Swedish where 0.6% of the target-side test data are words with a colon for which we fail to retrieve candidates from Wikipedia). Tokenization remains a challenge for normalization of unpreprocessed non-standard data (Berg-Kirkpatrick et al., 2013). Figure 1: Test set performance breakdown by unambiguous and ambiguous tokens in n = 1K (top) and n = 5K (bottom) semi-supervised conditions. Comparisons are between neural HMM (+direct+rerank, green) and cSMT (Ljubešić et al., 2016, violet). Ambiguity (=whether a historical word normalizes into more than one standard word type) is computed on the official training data.

Future work
Clearly, one can simultaneously use both methods of candidate generation ( §4.5). We leave it for future work to verify whether this leads to an improved performance.
Computing the posterior p(y | x) in both generative models is hard, which is why we are forced to reduce the number of admissible candidates y and, in the case of the RNN LM-based model, approximate the posterior with maximization. This problem can be addressed in a principled way by using variational inference (Jordan et al., 1999), a framework for approximate inference that deals with intractable distributions. We leave it for future work to validate its effectiveness for this problem.
As noted earlier, it is a simplification to assume that non-standard text is tokenized. Being able to normalize across token boundaries (by merging multiple non-standard tokens or splitting one into multiple standardized ones) is crucial for tackling real-world text normalization tasks and related problems such as optical character recognition error correction. An appealing direction for future work would be developing a joint model for text tokenization and normalization. One family of latent-variable models that would be suitable for this task are segmental recurrent neural networks (SRNNs, Kong et al., 2016). SRNNs explicitly model input segmentation and have been successfully applied to online handwriting recognition, Chinese word segmentation, joint word segmentation and part-of-speech tagging (Kong et al., 2016;Kawakami et al., 2019).

Conclusion
This paper proposes semi-supervised contextual normalization of non-standard text. We focus on historical data, which has gained attention in the digital humanities community over the past years. We develop simple contextualized generative neural models that we train with expectationmaximization. By leveraging unlabeled data and accessing context at training time, we train accurate models with fewer manually normalized training samples. No labeled training data are necessary to achieve 88.4% of the best published performance that uses full training sets. Strong gains are observed for most of the considered languages across realistic low-resource settings (up to 5K labeled training tokens). The techniques developed here readily apply to other types of normalization data (e.g. informal, dialectal). We will make our implementation publicly available. 2 modern candidates for frequent historical words. Using the smallest supervised development set (500 tokens)-required by all our experimentswe compute coverage over all languages, and set the log base for squeezing the frequency ratio to 200. Finally, a penalty for rare modern forms based on their frequency is added. The score for reranking the candidate list is: s h,m = ED(h, m)+ max(log 200 (f h,m ), 0) + 1 #(m) . Neural channel training. In every M-step (the maximization of Eq. 4), we start training from the previous EM iteration's best parameters θ (t−1) and train for 25 epochs with 15 epochs of patience. We optimize the parameters with AdaDelta (Zeiler, 2012) using mini-batches of size 20. We do not update on candidates whose posterior probability is below = 0.01 · λ −1 . If the development set scores (x,y)∈ dev p θ (x | y) do not increase compared to the previous EM iteration, we restart training from randomly initialized parameters and using the type-level posterior probability from the best generative model found so far. That model, decoded using MAP decoding, has so far produced the highest normalization accuracy on the development set. In the semi-supervised scenario, we initially pretrain the channel for 50 epochs and 15 epochs of patience using mini-batches of size 1, as suggested by Makarov and Clematide (2018).
Sentence-wise reranker. Table 7 shows the features used in the sentence-wise PRO reranker (Hopkins and May, 2011). We learn the reranker parameters on the development set using perceptron as our binary classification learning algorithm (we also experimented with different losses and a stochastic gradient learner from the sklearn library (Pedregosa et al., 2011), but this did not produce any gains).
Direct model training. We optimize direct models with AdaDelta using mini-batches of size 10. We train for 60 epochs with 15 epochs of patience. We decode them with beam search with beam width eight. We learn a PRO reranker on the development set using hypotheses from the beam. To represent hypotheses, we use features such as the direct model probability of the hypothesis q φ (ŷ | x), its channel model probability p θ (x |ŷ), its unigram probability, the relative frequency of the (historical word, hypothesis) pair in the training data, or the edit distance between the hypothesis and the historical input word (Ta-p WORD-RNN-LM (ŷ 1:m )/m p CHAR-RNN-LM (ŷ 1:m )/m p WORD-TRIGRAM-LM (ŷ 1:m )/m length m p(x 1:m ,ŷ 1: ble 8). We rank hypotheses with a combination of normalized edit distance (NED) and accuracy: ∆(y,ŷ) = 1{y =ŷ} − NED(y,ŷ) (7) Thus, a hypothesisŷ attains the highest score of +1 if it is identical to the target y of a development set sample and the lowest score of −1 if the number of edits fromŷ to y equals the maximum of their lengths. p TRAIN (x,ŷ) (x,ŷ) ∈ TRAIN? same-suffix k (x,ŷ)? subsequence(x,ŷ)? same-prefix k (x,ŷ)? subsequence(ŷ, x)? Table 8: Features used to rerank hypotheses generated from the direct model. x is a historical word andŷ is a modern language hypothesis.