Neural text normalization leveraging similarities of strings and sounds

We propose neural models that can normalize text by considering the similarities of word strings and sounds. We experimentally compared a model that considers the similarities of both word strings and sounds, a model that considers only the similarity of word strings or of sounds, and a model without the similarities as a baseline. Results showed that leveraging the word string similarity succeeded in dealing with misspellings and abbreviations, and taking into account the sound similarity succeeded in dealing with phonetic substitutions and emphasized characters. So that the proposed models achieved higher F1 scores than the baseline.


Introduction
Non-standard words such as misspellings or abbreviations are often used in social media such as Twitter. It is difficult to understand the meaning of such words without any extra knowledge, and it is challenging to perform natural language processing over sentences including them. Text normalization plays an important role in dealing with these broken texts by correcting such a sentence into a standard one. The following is an example of text normalization.
Even though the similarities of character surfaces and phonemes is important for a human to understand non-standard words, current neural network-based text normalization methods do not consider this information. We assume that text normalization more intuitive to humans is possible by explicitly considering such features in a neural network-based method. Based on this assumption, in this work, we propose neural text normalization models that leverage both string and sound similarities. Experimental results show that our proposed models outperformed a baseline and achieved state-of-the-art results in the text normalization track on WNUT-2015.  Figure 1: How to introduce the character string feature into Seq2Seq.
performance on the WNUT-2015 task, with a method that generates candidates based on the training data. van der Goot and van Noord (2017) extended this work by leveraging additional resources of Twitter and Wikipedia data. However, their method does not take into consideration contextual information.
To solve this problem, Sequence-to-Sequence (Seq2Seq) (Sutskever et al., 2014) has been used for text normalization. Lourentzou et al. (2019) performed highly accurate and fluent text normalization by using the Attention-based Seq2Seq (Bahdanau et al., 2014). Mani et al. (2020) used Seq2Seq in automatic speech recognition error correction, a task similar to text normalization. Taking these trends into account, in this work, we propose a neural text normalization model that leverages both string and sound similarity.
3 Methodology Figure 1 shows an overview of our proposed method. In this study, we perform text normalization based on the method of Lourentzou et al. (2019), which incorporates token embeddings as an input to the encoder. Our method expands their work by utilizing features related to character strings and sounds. Moon et al. (2018) proposed Deep Levenshtein, a network that captures the feature of character strings based on the Levenshtein edit distance (Levenshtein, 1966) in order to correct any character fluctuations of named entities in a text. In this study, we incorporate this mechanism into the Seq2Seq model to make it more robust to a broken text. Deep Levenshtein is a neural network that takes two words x and y as an input and then outputs hidden representations for the character strings of the words. Two words x and y are fed into the word embedding layer, and then we obtain word embeddings e x and e y , respectively. Each word embedding is an input to the bidirectional LSTM (Hochreiter and Schmidhuber, 1997) to obtain hidden representations c x and c y , that capture the feature of the character string. Here, − → h x and ← − h x represent the forward and backward paths of the bidirectional LSTM, respectively. c x is obtained by concatenating them as c

Deep Levenshtein
where ; indicates a concatenation operation. Deep Levenshtein learns representations c x and c y so that the cosine-based similarity between them approximates the similarity based on the edit distance between the input character strings by minimizing where s(x, y) = 1 − d(x,y) max(length(x),length(y)) , 1 that indicates the similarity based on the edit distance between the input words x and y.
Deep Levenshtein can predict the similarity of the character string between two words based on the distance in vector space, so the vectors obtained above capture the character string feature.

Deep Metaphone
Raghuvanshi et al. (2019) revealed that relying only on surface text similarities cannot capture phonetic differences between words. Furthermore, Han et al. (2013) showed that sound-related features are effective in text normalization. In this work, we propose Deep Metaphone to capture sound features for text normalization by learning phonetic edit distance. Deep Metaphone has the same network structure as Deep Levenshtein. The difference between them is the training data they use. Deep Metaphone learns the phonetic edit distance, which is the edit distance of the strings obtained from Double Metaphone (Philips, 2000). The following is an example conversion where we apply the Double Metaphone algorithm over the words 'yeeeees' and 'yes'. By using Double Metaphone, similar to Deep Levenshtein, Deep Metaphone can predict the sound similarity between two words. Therefore, we can use the vector that captures the feature of the sound.

Incorporating new features to Seq2Seq
In this study, we use a bi-directional LSTM for the encoder and decoder.

Experimental settings
We used WNUT-2015 Shared Task2 (Baldwin et al., 2015), 3 which is a task of normalizing social media texts, for our evaluation. The official dataset consists of 4,917 tweets with 373 non-standard words. The dataset was randomly split by Lourentzou et al. (2019) into 60:40, 2,950 tweets for the training data and 1,967 tweets for the test data. For evaluation metrics, we used precision, recall, and F 1 -score.
To train Deep Levenshtein, we applied the noise generator in (Lourentzou et al., 2019) to words in the training data of WNUT-2015 Shared Task2. The noise generator outputs character strings by executing one of the processes in Table 1. We used both those generated words and the original words to train Deep Levenshtein.   For training Deep Metaphone, we first created the training data for Deep Levenshtein as in the above and then applied the Double Metaphone algorithm to words to yield the training data.
Parameters including the size of the token embeddings in the baseline model were set to the same as in (Lourentzou et al., 2019), where it was 100. The sizes of hidden layers, c leven xt and c phone xt , of LSTM for Deep Levenshtein and Deep Metaphone were tuned from {10, 20, 30, 40, 50} on the validation data, randomly extracted 100 sentences from the training data. When only Deep Levenshtein was used, 20 was selected. When only Deep Metaphone was used, 50 was selected. When both were used, 10 was selected for each of them. The reason why we set the range of smaller values from 10 to 50 compared with the size of token embeddings, 100, is based on the finding in (Sennrich and Haddow, 2016). They reported that the size of the secondary embeddings, i.e., c leven xt and c phone xt in our case, should be smaller than that of the primary embeddings, i.e., e token xt in our case.

Compared models
In the experiments, we compared a baseline model and our models, which are listed below.
Two-stage Seq2Seq (baseline): A model proposed by Lourentzou et al. (2019). We reimplemented this model and will report its performance in addition to the scores reported in their paper.

Results and analysis
We report the performance of the baseline model and our models in Table 2. As shown, our models, i.e., Two-stage Seq2Seq +LS, +MP, +LS +MP, outperformed the baseline model in terms of F 1 score. Table 3 shows example outputs from the Two-stage Seq2Seq model and our models. From the table, we can see that Two-stage Seq2Seq + LS and + LS + MP models corrected the misspelled 'wen' to 'when'. Two-stage Seq2Seq + LS and + MP models corrected the abbreviation 'diss' to 'disrespect'. Only Twostage Seq2Seq + MP model corrected the phonetic substitution 'd' to 'the'. All of our models corrected the emphasized character 'homeee' to 'home'. These results show that Two-stage Seq2Seq + LS model works well for misspellings and abbreviations and that Two-stage Seq2Seq + MP model works well for phonetic substitutions and emphasized characters.
On the contrary, the weakness of Two-stage Seq2Seq + LS model is that it fails to deal with typos where we need to correct from a character to another character that is far on the keyboard layout, such as 'thang' to 'thing'. The weakness of Two-stage Seq2Seq + MP model is that it tends to mistakenly correct a word to another word having a similar sound, such as 'nah' to 'no'. We also found that, when considering both features, we observed several error cases such as 'favor' to 'favorite', where the model wrongly changed a word. We speculate that these are due to the small search range of the sizes of the hidden layers (c leven xt and c phone xt ). Furthermore, all models were not able to handle the correction of an emoji, such as 'b', meaning thumbs-up. This is because the knowledge in the current models is based only on the training data, and we do not utilize specific knowledge on the connection between linguistic information and other types of information, including visual information, which we leave for our future work.

Conclusion and future work
In this paper, we proposed a method that takes into account the similarities of word strings and sounds as features for text normalization. Our evaluation results showed that incorporating such features to Seq2Seq improves the performance of text normalization compared to the baseline method in terms of F 1 score. We also found that the proposed methods managed to effectively consider both surface character and phonetic similarities.
For future work, we will try data augmentation based on the finding by Grundkiewicz et al. (2019). They reported that synthetic data can be generated by substituting words commonly confused with each other to improve performance. On the contrary, we will use the dataset proposed by Sproat and Jaitly (2016), a much larger dataset than the WNUT-2015 Shared Task2, to empirically gain deep insights. We would also like to apply ELMo (Peters et al., 2018), Transformer (Vaswani et al., 2017) and BERT (Devlin et al., 2019) to Seq2Seq for better performance.