Learning to Capitalize with Character-Level Recurrent Neural Networks: An Empirical Study

In this paper, we investigate case restoration for text without case information. Previous such work operates at the word level. We pro-pose an approach using character-level recurrent neural networks (RNN), which performs competitively compared to language modeling and conditional random ﬁelds (CRF) approaches. We further provide quantitative and qualitative analysis on how RNN helps improve truecasing.


Introduction
Natural language texts (e.g., automatic speech transcripts or social media data) often come in nonstandard forms, and normalization would typically improve the performance of downstream natural language processing (NLP) applications. This paper investigates a particular sub-task in text normalization: case restoration or truecasing. Truecasing refers to the task of restoring case information (uppercase or lowercase) of characters in a text corpus. Case information is important for certain NLP tasks. For example, Chieu and Ng (2002) used unlabeled mixed case text to improve named entity recognition (NER) on uppercase text.
The task often presents ambiguity: consider the word "apple" in the sentences "he bought an apple" and "he works at apple". While the former refers to a fruit (hence, it should be in lowercase), the latter refers to a company name (hence, it should be capitalized). Moreover, we often need to recover the case information for words that are previously unseen by the system.
In this paper, we propose the use of characterlevel recurrent neural networks for truecasing. Previous approaches for truecasing are based on word level approaches which assign to each word one of the following labels: all lowercase, all uppercase, initial capital, and mixed case. For mixed case words, an additional effort has to be made to decipher exactly how the case is mixed (e.g., MacKenzie). In our approach, we propose a generative, character-based recurrent neural network (RNN) model, allowing us to predict exactly how cases are mixed in such words.
Our main contributions are: (i) we show that character-level approaches are viable compared to word-level approaches, (ii) we show that characterlevel RNN has a competitive performance compared to character-level CRF, and (iii) we provide our quantitative and qualitative analysis on how RNN helps improve truecasing.

Related Work
Word-based truecasing The most widely used approach works at the word level. The simplest approach converts each word to its most frequently seen form in the training data. One popular approach uses HMM-based tagging with an N-gram language model, such as in (Lita et al., 2003;Nebhi et al., 2015). Others used a discriminative tagger, such as MEMM (Chelba and Acero, 2006) or CRF (Wang et al., 2006). Another approach uses statistical machine translation to translate uncased text into a cased one. Interestingly, no previous work operated at the character level. Nebhi et al. (2015) investigated truecasing in tweets, where truecased cor-pora are less available.
Recurrent neural networks Recent years have shown a resurgence of interest in RNN, particularly variants with long short-term memory (Hochreiter and Schmidhuber, 1997) or gated recurrent units (Cho et al., 2014). RNN has shown an impressive performance in various NLP tasks, such as machine translation (Cho et al., 2014;Luong et al., 2015), language modeling (Mikolov et al., 2010;Kim et al., 2016), and constituency parsing (Vinyals et al., 2015). Nonetheless, understanding the mechanism behind the successful applications of RNN is rarely studied. In this work, we take a closer look at our trained model to interpret its internal mechanism.

The Truecasing Systems
In this section, we describe the truecasing systems that we develop for our empirical study.

Word-Level Approach
A word-level approach truecases one word at a time.
The first system is a tagger based on HMM (Stolcke, 2002) that translates an uncased sequence of words to a corresponding cased sequence. An Ngram language model trained on a cased corpus is used for scoring candidate sequences. For decoding, the Viterbi algorithm (Rabiner, 1989) computes the highest scoring sequence.
The second approach is a discriminative classifier based on linear chain CRF (Lafferty et al., 2001). In this approach, truecasing is treated as a sequence labeling task, labelling each word with one of the following labels: all lowercase, all uppercase, initial capital, and mixed case. For our experiments, we used the truecaser in Stanford's NLP pipeline (Manning et al., 2014). Their model includes a rich set of features (Finkel et al., 2005), such as surrounding words, character N-grams, word shape, etc.
Dealing with mixed case Both approaches require a separate treatment for mixed case words. In particular, we need a gazetteer that maps each word to its mixed case form -either manually created or statistically collected from training data. The character-level approach is motivated by this: Instead of treating them as a special case, we train our model to capitalize a word character by character.

Character-Level Approach
A character-level approach converts each character to either uppercase or lowercase. In this approach, mixed case forms are naturally taken care of, and moreover, such models would generalize better to unseen words. Our third system is a linear chain CRF that makes character-level predictions. Similar to the word-based CRF, it includes surrounding words and character N-grams as features.
Finally, we propose a character-level approach using an RNN language model. RNN is particularly useful for modeling sequential data. At each time step t, it takes an input vector x t and previous hidden state h t−1 , and produces the next hidden state h t . Different recurrence formulations lead to different RNN models, which we will describe below.
Long short-term memory (LSTM) is an architecture proposed by Hochreiter and Schmidhuber (1997). It augments an RNN with a memory cell vector c t in order to address learning long range dependencies. The content of the memory cell is updated additively, mitigating the vanishing gradient problem in vanilla RNNs (Bengio et al., 1994). Read, write, and reset operations to the memory cell are controlled by input gate i, output gate o, and forget gate f . The hidden state is computed as: where σ and tanh are element-wise sigmoid and hyperbolic tangent functions, and W j and U j are parameters of the LSTM for j ∈ {i, o, f, g}. Gated recurrent unit (GRU) is a gating mechanism in RNN that was introduced by Cho et al. (2014). They proposed a hidden state computation with reset and update gates, resulting in a simpler LSTM variant:  Table 2: Truecasing performance in terms of precision (P), recall (R), and F 1 . All improvements of the best performing character-based systems (bold) over the best performing word-based systems (underlined) are statistically significant using sign test (p < 0.01). All improvements of the best performing RNN systems (italicized) over CRF-CHAR are statistically significant using sign test (p < 0.01).
At each time step, the conditional probability distribution over next characters is computed by linear projection of h t followed by a softmax: where w k is the k-th row vector of a weight matrix W . The probability of a sequence of characters x 1:T is defined as: Similar to the N-gram language modeling approach we described previously, we need to maximize Equation 12 in order to decode the most probable cased sequence. Instead of Viterbi decoding, we approximate this using a beam search.

Datasets and Tools
Our approach is evaluated on English and German datasets. For English, we use a Wikipedia corpus from (Coster andKauchak, 2011), WSJ corpus (Paul andBaker, 1992), and the Reuters corpus from the CoNLL-2003 shared task on named entity recognition (Tjong Kim Sang and De Meulder, 2003). For German, we use the ECI Multilingual Text Corpus from the same shared task. Each corpus is tokenized. 1 The input test data is lowercased. Table 1 shows the statistics of each corpus split into training, development, and test sets. We use SRILM (Stolcke, 2002) for N-gram language model training (N ∈ {3, 5}) and HMM decoding. The word-based CRF models are trained using the CRF implementation in Stanford's CoreNLP 1 News headlines, which are all in uppercase, are discarded. 3.6.0 (Finkel et al., 2005). We use a recommended configuration for training the truecaser.We use CRF-Suite version 0.12 (Okazaki, 2007) to train the character-based CRF model. Our feature set includes character N-grams (N ∈ {1, 2, 3}) and word N-grams (N ∈ {1, 2}) surrounding the current character. We tune the 2 regularization parameter λ using a grid search where λ ∈ {0.01, 0.1, 1, 10}.
We use an open-source character RNN implementation. 2 We train a SMALL model with 2 layers and 300 hidden nodes, and a LARGE model with 3 layers and 700 hidden nodes. We also vary the hidden unit type (LSTM/GRU). The network is trained using truncated backpropagation for 50 time steps. We use a mini-batch stochastic gradient descent with batch size 100 and RMSprop update (Tieleman and Hinton, 2012). We use dropout regularization (Srivastava et al., 2014) with 0.25 probability. We choose the model with the smallest validation loss after 30 epochs. For decoding, we set beam size to 10. The experimental settings are reported in more depth in the supplementary materials. Our system and code are publicly available at http://statnlp.org/research/ta/.  Table 2 shows the experiment results in terms of precision, recall, and F 1 . Most previous work did not evaluate their approaches on the same dataset. We compare our work to (Chelba and Acero, 2006) using the same WSJ sections for training and evaluation on 2M word training data. Chelba and Acero only reported error rate, and all our RNN and CRF approaches outperform their results in terms of error rate.

Results
First, the word-based CRF approach gives up to 8% relative F 1 increase over the LM approach. Other than WSJ, moving to character level further improves CRF by 1.1-3.7%, most notably on the German dataset. Long compound nouns are common in the German language, which generates many out-of-vocabulary words. Thus, we hypothesize that character-based approach improves generalization. Finally, the best F 1 score for each dataset is achieved by the RNN variants: 93.19% on EN-Wiki, 92.43% on EN-WSJ, 93.79% on EN-Reuters, and 98.01% on DE-ECI.
We highlight that different features are used in CRF-WORD and CRF-CHAR. CRF-CHAR only includes simple features, namely character and word N-grams and sentence boundary indicators. In contrast, CRF-WORD contains a richer feature set that is predefined in Stanford's truecaser. For instance, it includes word shape, in addition to neighboring words and character N-grams. It also includes more feature combinations, such as the concatenation of the word shape, current label, and previous label. Nonetheless, CRF-CHAR generally performs better than CRF-WORD. Potentially, CRF-CHAR can be improved further by using larger N-grams. The decision to use simple features is for optimizing the training speed. Consequently, we are able to dedicate more time for tuning the regularization weight.
Training a larger RNN model generally improves performance, but it is not always the case due to possible overfitting. LSTM seems to work better than GRU in this task. The GRU models have 25% less parameters. In terms of training time, it took 12 hours to train the largest RNN model on a single Titan X GPU. For comparison, the longest training time for a single CRF-CHAR model is 16 hours. Training LM and CRF-WORD is much faster: 30 seconds and 5.5 hours, respectively, so there is a speed-accuracy trade-off.

Visualizing LSTM Cells
An interesting component of LSTM is its memory cells, which is supposed to store long range dependency information. Many of these memory cells are not human-interpretable, but after introspecting our trained model, we find a few memory cells that are sensitive to case information. In Figure 1, we plot the memory cell activations at each time step (i.e., tanh(c t )). We can see that these cells activate differently depending on the case information of a word (towards -1 for uppercase and +1 for lowercase).  In this section, we analyze the system performance on each case category. First, we report the percentage distribution of the case categories in each test set in Table 3. For both languages, the most frequent case category is lowercase, followed by capitalization, which generally applies to the first word  in the sentence and proper nouns. The uppercase form, which is often found in abbreviations, occurs more frequently than mixed case for English, but the other way around for German.

Case Category and OOV Performance
Figure 2 (a) shows system accuracy on mixed case words. We choose the best performing LM and RNN for each dataset. Character-based approaches have a better performance on mixed case words than word-based approaches, and RNN generally performs better than CRF. In CRF-WORD, surface forms are generated after label prediction. This is more rigid compared to LM, where the surface forms are considered during decoding.
In addition, we report system accuracy on capitalized words (first letter uppercase) and uppercase words in Figure 2 (b) and (c), respectively. RNN performs the best on capitalized words. On the other hand, CRF-WORD performs the best on uppercase. We believe this is related to the rare occurrences of uppercase words during training, as shown in Table 3. Although mixed case occurs more rarely in general, there are important clues, such as character prefix. CRF-CHAR and RNN have comparable performance on uppercase. For instance, there are only 2 uppercase words in WSJ that were predicted differently between CRF-CHAR and RNN. All systems perform equally well (∼99% accuracy) on lowercase. Overall, RNN has the best performance.
Last, we present results on out-of-vocabulary (OOV) words with respect to the training set. The statistics of OOV words is given in Table 3. The system performance across datasets is reported in Figure 2 (d). We observe that RNN consistently performs better than the other systems, which shows that it generalizes better to unseen words.

Conclusion
In this work, we conduct an empirical investigation of truecasing approaches. We have shown that character-level approaches work well for truecasing, and that RNN performs competitively compared to language modeling and CRF. Future work includes applications in informal texts, such as tweets and short messages (Muis and Lu, 2016).