Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models

Text normalization is an important enabling technology for several NLP tasks. Recently, neural-network-based approaches have outperformed well-established models in this task. However, in languages other than English, there has been little exploration in this direction. Both the scarcity of annotated data and the complexity of the language increase the difficulty of the problem. To address these challenges, we use a sequence-to-sequence model with character-based attention, which in addition to its self-learned character embeddings, uses word embeddings pre-trained with an approach that also models subword information. This provides the neural model with access to more linguistic information especially suitable for text normalization, without large parallel corpora. We show that providing the model with word-level features bridges the gap for the neural network approach to achieve a state-of-the-art F1 score on a standard Arabic language correction shared task dataset.


Introduction
Text normalization systems have many potential applications -from assisting native speakers and language learners with their writing, to supporting NLP applications with sparsity reduction by cleaning large textual corpora. This can help improve benchmarks across many NLP tasks.
In recent years, neural encoder-decoder models have shown promising results in language tasks like translation, part-of-speech tagging, and text normalization, especially with the use of an attention mechanism. In text normalization, however, previous state-of-the-art results rely on developing many other pipelines on top of the neural model. Furthermore, such neural approaches have barely been explored for this task in Arabic, where previous state-of-the-art systems rely on combining various statistical and rule-based approaches.
We experiment with both character embeddings and pre-trained word embeddings, using several embedding models, and we achieve a state-of-theart F 1 score on an Arabic spelling correction task.

Related Work
The encoder-decoder neural architecture  has shown promising results in text normalization tasks, particularly in character-level models (Xie et al., 2016;Ikeda et al., 2016). More recently, augmenting this neural architecture with the attention mechanism Luong et al., 2015) has dramatically increased the quality of results across most NLP tasks. However, in text normalization, state-of-the-art results involving attention (e.g., Xie et al. 2016) also rely on several other models during inference, such as language models and classifiers to filter suggested edits. Neural architectures at the word level inherently rely on multiple models to align and separately handle out-of-vocabulary (OOV) words (Yuan and Briscoe, 2016).
In the context of Arabic, we are only aware of one attempt to use a neural model for end-to-end text normalization (Ahmadi, 2017), but it fails to beat all baselines reported later in this paper. Arabic diacritization, which can be considered forms of text normalization, has received a number of neural efforts (Belinkov and Glass, 2015;Abandah et al., 2015). However, state-of-the-art approaches for end-to-end text normalization rely on several additional models and rule-based approaches as hybrid models (Pasha et al., 2014;Nawar, 2015;Zalmout and Habash, 2017), which introduce direct human knowledge into the system, but are limited to correcting specific mistakes and rely on expert knowledge to be developed.

Approach
Many common mistakes addressed by text normalization occur fundamentally at the character level. Moreover, the input data tends to be too noisy for a word-level neural model to be an endto-end solution due to the high number of OOV words. In Arabic, particularly, mistakes may range from simple orthographic errors (e.g., positioning of Hamzas) and keyboard errors to dialectal code switching and spelling variations, making the task more challenging than a generic language correction task. We opt for a character-level neural approach to capture these highly diverse mistakes. While this method is less parallelizable due to the long sequence lengths, it is still more efficient due to the small vocabulary size, making inference and beam search computationally feasible.

Neural Network Architecture
Given an input sentence x and its corrected version y, the objective is to model P (y|x). The vocabulary can consist of any number of unique tokens, as long as the following are included: a padding token to make input batches have equal length, the two canonical start-of-sentence and end-of-sentence tokens of the encoder-decoder architecture, and an OOV token to replace any character outside the training data during inference. Each character x i in the source sentence x is mapped to the corresponding d ce -dimensional row vector c i of a learnable d voc × d ce embedding matrix, initialized with a random uniform distribution with mean 0 and variance 1. For the encoder, we learn d-dimensional representations for the sentence with two gated recurrent unit (GRU) layers , making only the first layer bidirectional following Wu et al. (2016). Like long short-term memory (Hochreiter and Schmidhuber, 1997), GRU layers are well-known to improve the performance of recurrent neural networks (RNN), but are slightly more computationally efficient than the former.
For the decoder, we use two GRU layers along with the attention mechanism proposed by Luong et al. (2015) over the encoder outputs h i . The initial states for the decoder layers are learned with a fully-connected tanh layer in a similar fashion to , but we do so from the first encoder output. During training, we use scheduled sampling (Bengio et al., 2015) and feed the d ce -dimensional character embeddings at ev- ery time step, but using a constant sampling probability. While tuning scheduled sampling, we found that introducing a sampling probability provided better results than relying on the ground truth, i.e., teacher forcing (Williams and Zipser, 1989). However, introducing a schedule did not yield any improvement as opposed to keeping the sampling probability constant and unnecessarily complicates hyperparameter search. For both the encoder and decoder RNNs, we also use dropout (Srivastava et al., 2014) on the non-recurrent connections of both the encoder and decoder layers during training.
The decoder outputs are fed to a final softmax layer that reshapes the vectors to dimension d voc to yield an output sequence y. The loss function is the canonical cross-entropy loss per time step averaged over the y i .

Word Embeddings
To address the challenge posed by the small amount of training data, we propose adding pretrained word-level information to each character embedding. To learn these word representations, we use FastText (Bojanowski et al., 2016), which extends Word2Vec (Mikolov et al., 2013) by adding subword information to the word vector. This is very suitable for this task, not only because many mistakes occur at the character level, but also because FastText handles almost all OOVs by omitting the Word2Vec representation and simply using the subword-based representation. It is possible, yet extremely rare that FastText cannot handle a word-this can occur if the word contains an OOV n-gram or character that did not appear in the data used to train the embeddings. It should also be noted that these features are only fed to the encoder layer; the decoder layer only receives d cedimensional character embeddings as inputs, and the softmax layer has a d voc -dimensional output. Each character embedding c i is replaced by the concatenation [c i ; w j ] before being fed to the encoder-decoder model, where w j is the d wedimensional word embedding for the word in which c i appears in. This effectively handles almost all cases except white spaces, in which we just always append a d we -dimensional vector w _ initialized with a random uniform distribution of mean 0 and variance 1. For OOVs, we just append the whitespace embedding w _ to the word's characters.

Inference
During inference, the decoder layer uses a beam search, keeping a fixed number (i.e., beam width) of prediction candidates with the highest loglikelihood at each step. Whenever an "end-ofsentence" token is produced in a beam, the decoder stops predicting further tokens for it. We then pick the individual beam with the highest overall log-likelihood as our prediction. As a final step, we reduce text sequences that are repeated six or more times to a threshold of 5 repetitions (e.g., "abababababab" → "ababababab"). This helps address rare cases where the decoder misbehaves and produces non-stop repetitions of text, and also helps avoid extreme running times for the NUS MaxMatch scorer (Dahlmeier and Ng, 2012), which we use for evaluation and comparison purposes.

Data
We tested the proposed approach on the QALB dataset, a corpus for Arabic language correction and subject of two shared tasks Rozovskaya et al., 2015). Following the guidelines of both shared tasks, we only used the training data of the QALB 2014 shared task corpus (19,411 sentences). Similarly, the validation dataset used is only that of the QALB 2014 shared task, consisting of 1,017 sentences. We use two blind tests, one from each year. During training, we only kept sentences of up to 400 characters in length, resulting in the loss of 172 sentences.

Metric
Like in the QALB shared tasks, we use the Max-Match scorer to compute the optimal word-level edits that map each source sentence to its respective corrected sentence. We report the F 1 score of these edits against those provided in the gold data by the same tool. We compare against the best reported system in the QALB 2014 shared task test set (CLMB) , as well as the best in the QALB 2014 shared task development and the QALB 2015 shared task test sets (CUFE) (Nawar, 2015).

Baselines
We consider two different baselines. The first is the output from MADAMIRA (Pasha et al., 2014), a tool for morphological analysis and disambiguation of Arabic. The second is using maximum likelihood estimation (MLE) based on the counts of the MaxMatch gold edits from the training data; that is, each word or phrase gets either replaced  or kept as is, depending on the most common action in the training data. We found that, unlike Eskander et al. (2013) suggested, first using MADAMIRA and then MLE yields better results than composing these in the reverse order. The results are presented in Table 1.

Model Settings
In all experiments, we used batch and character embedding sizes of b = d ce = 128, hidden layer size of d = 256, dropout probability of 0.1, decoder sampling probability of 0.35, and gradient clipping with a maximum norm of 10. When running all the trained models during inference, we used a beam width of 5. We used the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.0005, =1·10 −8 , β 1 =0.9, and β 2 =0.999, and trained the model for 30 epochs. We report three different setups with FastText word embeddings: narrow, wide, and the concatenation of both. For each of these, we report results on two separately trained models: one without preprocessing, and one with MADAMIRA and then MLE preprocessing to the inputs. We also report an ablation study where we choose the best of these six trained models and compare against two separately trained models with identical setups, but using Word2Vec and no word-level features, respectively. All the word embeddings used are of dimension d we = 300, and were trained with a single epoch over the Arabic Gigaword corpus (Parker et al., 2011). In the experiments including preprocessing, the respective word vectors were obtained from Gigaword preprocessed with MADAMIRA. The narrow and wide word embeddings were trained with context windows of sizes 2 and 5, respectively. All other hyperparameters were kept to the default FastText values, except the minimum  n-gram size, which was reduced from 3 to 2 to compensate for single-character prefixes and suffixes that appear in Arabic when omitting the short vowels .

Results
Development set results are shown in Tables 2  and 3, test set results in Table 4. In all models, training without preprocessing consistently yielded better results than their analogues with the inputs pre-fed to MADAMIRA and then MLE. All the FastText embeddings setups with no preprocessing outperformed the previous state-of-the-art results in the development dataset. We hypothesize that this is occurs because the model has access to more examples of errors to normalize during training, thereby increasing performance. The best performing model was that with the narrow word embeddings; consistent with the results of  showing the superior performance of narrow word embeddings over both wide embeddings and the concatenation of both. This is justified by Goldberg (2015) and Trask et al. (2015), who illustrate that wider word embeddings tend to capture more semantic information, while narrower word embeddings model more syntactic phenomena.
In our ablation study, we compared the performance of the narrow FastText embeddings against narrow Word2Vec embeddings trained over the same Arabic Gigaword corpus with the same hyperparameters, as well as to no word-level embeddings at all. The results, displayed in Table 3, show that using only Word2Vec slightly increases precision but significantly hurts recall. This highlights the effectiveness of using FastText for text normalization, as well as the need to handle OOVs in a noisy context for word-level representations to help in this particular problem. Despite that having OOV cases can help the model by indicating that a word is likely erroneous, this does not provide linguistic information the way FastText does. The narrow FastText embeddings with no prepro-   67.91 -CUFE (Nawar, 2015) 66.68 72.87 Narrow embeds 70.39 73.19 cessing setup achieved state-of-the-art results in all three datasets, beating all systems in both the 2014 and 2015 QALB shared tasks in F 1 score.

Error Analysis
We conducted a detailed error analysis of 101 sentences from the development set (6,370 words). The sample contained 1,594 erroneous words (25%). The errors were manually classified in a number of categories, which are presented in Table 5. The Table indicates the percentage of the error type in the whole set of errors as well as the error-specific F 1 and an example. Some common problems, Hamza (glottal stop) and Ta Marbuta (feminine ending), are well handled in our best system. This is due to their commonality in the training data. Other types are less commondialectal constructions, consonantal switches and Mood/Case. Punctuation is very common, however it is also very idiosyncratic. We also note the presence of a small percentage (0.5%) of gold errors. For more information on Arabic language orthography issues from a computational perspective, see (Buckwalter, 2007;Habash, 2010;Habash et al., 2012).

Conclusion and Future Work
We propose a novel approach to text normalization by enhancing character embeddings with word-level features that model subword information and model syntactic phenomena. This significantly improves the neural model's recall, allowing the correction of more complex and diverse errors. Our approach achieves state-of-the-art results in the QALB dataset, despite it being small and seemingly unsuited for a neural model. Moroever, our neural model is sophisticated enough to not benefit from preprocessing techniques that reduce the number of errors in the data. Our approach is general enough to be implemented for any other text normalization task and does not rely  on domain-specific knowledge to develop. Future directions include expanding the number of training pairs via synthetic data generation, where generative models can potentially add human-like errors to a large, unannotated corpus. Different sequence-to-sequence architectures, such as the Transformer module (Vaswani et al., 2017), could also be explored and researched more exhaustively. The word-level features provided by FastText could also be replaced by separately trained neural approaches that generate word embeddings from a word's characters (e.g., ELMo embeddings, Peters et al. 2018), and could also be fine-tuned towards specific applications. Another interesting direction includes hybrid word-character architectures, where the encoder receives word-level features, while the decoder operates at the character level. We are also interested in applying our approach to other languages and dialects.