Noisy Uyghur Text Normalization

Uyghur is the second largest and most actively used social media language in China. However, a non-negligible part of Uyghur text appearing in social media is unsystematically written with the Latin alphabet, and it continues to increase in size. Uyghur text in this format is incomprehensible and ambiguous even to native Uyghur speakers. In addition, Uyghur texts in this form lack the potential for any kind of advancement for the NLP tasks related to the Uyghur language. Restoring and preventing noisy Uyghur text written with unsystematic Latin alphabets will be essential to the protection of Uyghur language and improving the accuracy of Uyghur NLP tasks. To this purpose, in this work we propose and compare the noisy channel model and the neural encoder-decoder model as normalizing methods.


Introduction
Uyghur is an alphabetic language, whose alphabet includes 32 phones. Currently, the Uyghur is written with Perso-Arabic, Latin or Cyrillic-based scripts. The most widely used Uyghur alphabet is the modified Perso-Arabic script. However, in some situations, especially in social media, users adopt Latin letters to overcome certain limitations of the Perso-Arabic script. A major problem is that Latin letters are irregularly used as alternatives to Perso-Arabic script because mapping between Perso-Arabic script and Latin alphabet is not trivial. For example, "X", "SH" or "Ş " are all used as alternative representations for the Perso-Arabic character (phoneme [ ]). Table 1, which based on the result of a conducted survey, shows that 15 out of 32 letters have two to four alternatives. To the best of our knowledge, although unsystematic usage of Latin-based alphabets is a well-discussed problem within Uyghur society, it does not appear in the literature. As far as we know it is only described in (Duval and Janbaz, 2006) as "unsystematic transliterations". In this paper, we refer to this issue as unsystematic usage of Latin alphabets (UULA).
UULA problem is similar to text normalization, which has received attention recently (Sproat et al., 2001;Ikeda et al., 2016) because of a large amount of unnormalized text in the social media. In this work, with respect to the smallest text element, we divide the text normalization problem into two sub-categories: word-based and character-based normalization. The wordbased normalization (Sproat et al., 2001;Ikeda et al., 2016) turns non-standard words such as slang, acronyms and phonetic substantiation into standard dictionary words. On the other hand, character-based normalization transform the raw text through substituting the irregularly used characters with proper ones. Character-level normalization includes problems such as diacritic restoration (DR) (Mihalcea and Nastase, 2002), de-ASCIIfication (Arslan, 2015) and so on.
UULA normalization is a character-level normalization, yet it is harder than other characterlevel normalization problems. It is a many-tomany mapping problem while most of the other types of character-level normalizations are one-tomany. As mentioned above, Table 1 shows 15 of 32 characters have 2 to 4 alternatives. Besides that, UULA texts suffer heavy ambiguity as well. For instance, if the sentence "I gave a Yuan" is written in Uyghur UULA as "Men bir koy berdim", which may mean "I gave a sheep" or "I gave a Yuan".  Table 2 shows some other cases of ambiguity. In short, UULA restoration which is addressed in this paper is a non-trivial problem. UULA restoration techniques are critical to process non-standard Uyghur text and develop a new type of input method editor (IME) that automatically suggests correctly written words and thus reduce the amount of UULA text. Figure 1 and Table 3 show several real examples of the increasing amount of UULA text on social media and the Internet. In this study we aim to 1) process and standardize the UULA text on the web so that it can be used for other NLP tasks such as information retrieval 2) help to create IMEs equipped with UULA restoration techniques that will prevent the generation of more non-standard text. Furthermore, although UULA restoration is a problem specific to the Uyghur language, the result will be useful for other character-level normalization problems and may be used for languages with similar mapping issues. The rest of the paper is organized as follows: we first talk about the background and related work in Section 2 and 3. Then, the methods for UULA restoration are described in Section 4. The experimental setup is given in Section 5 which is followed by results, and discussion. Finally, we talk about the conclusion and future work.

Uyghur Alphabets
Uyghur is the native language of more than 15 million Uyghur people. Currently, the modern Uyghur Perso-Arabic alphabet (UPAA) is the most used and official script of Xinjiang Uyghur Autonomous regions of China. In the last century, due to cultural and political reasons (Duval and Janbaz, 2006), Uyghurs have witnessed several reforms of the Uyghur writing system. Each of them brings certain adverse effects on Uyghur culture and society such as creating generation gaps, increasing illiteracy ratio, loss of materials written in previous scripts and so on. As a result, Uyghur society tends to refuse any new alternative scripts to the currently used UPAA. Furthermore, this social atmosphere causes unsuccessful propagation of an authentic Uyghur Latin alphabet system: Uyghur Latin alphabet (ULA), which is a project by Xin-jiang University in July 2001 (Duval and Janbaz, 2006). However, many Uyghur people have not adopted or even learned this system yet.
With the digital information age, Uyghur people, especially the young generation, are starting to use Latin letters to bypass the limitation related to the UPAA in social media and the internet. There are intrinsic and extrinsic limitations of UPAA. The intrinsic limitation is that, in many new computer programs, web pages, applications etc., UPAA suffers many problems such as unqualified display, absence of IME, and so on. On the other hand, the extrinsic limitation comes from users. Many Uyghur people are not familiar with the UPAA keyboard. Additionally, some Uyghur people consider typing with UPAA input method or switching to it from the other input methods like English as inconvenient work.
Although Uyghur people use Latin letters as an alternative to UPAA, many of them have not chosen the authentic ULA as the alternative. Before and after the announcement of ULA, both systematic and unsystematic transliterations with Latin letters were actively used. According to the survey mentioned in (Duval and Janbaz, 2006), up to 18 different systematic Latin Alphabet systems existed in 2000. These are replaced by the ULA since it is announced as the official Latin alternative of UPAA. However, UULA is still very common in spite of anti-UULA propaganda. Possible explanations can be found for this from many aspects: linguistic, social, political, and so on. These discussions are not in the scope of this paper as our goal is restoring and preventing UULA texts with the aid of an automated system.

Survey
In 2016, we conducted a small e-survey 2 about how Uyghur-speaking people use Latin alphabets when writing in Uyghur. In this survey, we included questions about the participants' favorite alphabet system and Latin-based alternatives to UPAA. Besides that, we asked them to write 10 different words or phrases given in Latin-derived alphabets they personally use (Table 5).
Among 170 attenders, 39.8% mainly used UPAA, 29.7% mainly use ULA, 30.5% use UULA. However, we also discovered that Uyghur people use different scripts in different circumstances. We discover that nearly half of the peo-2 available at http://goo.gl/forms/5Pi2vCeUr3 e,é, i k k, qëé g g,ñ,g z z, j ple use Latin-based characters as alternatives to UPAA frequently. Nevertheless, through asking attendees to type 10 different words or sentences with Latin letters, we concluded the pattern of UULA is the one shown in Table 4. According to the table, if a sentence includes all of these characters, there will be nearly 450,000 different alternative representations of that sentence.

Related Work
This is the first study on UULA restoration to our knowledge. However, the problem is closely related to text normalization, which is the focus of studies given in this section. With an exponential growth of noisy texts, the text normalization study has become a hot topic in NLP. In the literature, text normalization is viewed as being related to either spell-checking (Cook and Stevenson, 2009;Choudhury et al., 2007) or machine translation (Aw et al., 2006;Kobus et al., 2008;Ikeda et al., 2016). However, it is pointed out that traditional spell-checking algorithms are not very effective on some text normalization problems such as normalizing text messages like SMS, tweets, comments, etc (Pennell and Liu, 2010;Clark and Araki, 2011).
According to Kukich's early survey (Kukich, 1992) on automatic word correction, there are several types of spelling correction techniques such as minimum edit distance (Damerau, 1964), similarity key (Odell and Russell, 1918), rule-based methods (Yannakoudakis and Fawthrop, 1983), Ngram-based models (Riseman andHanson, 1974), probabilistic (Bledsoe andBrowning, 1959;Cook and Stevenson, 2009;Choudhury et al., 2007) and neural net techniques (Cherkassky and Vassilas, 1989). Among them, probabilistic models (e.g. noisy channel model) are successfully used for text normalization (Cook and Stevenson, 2009;Choudhury et al., 2007). The noisy channel model method normalizes non-standard words with the channel model and the language model, which are achieved by analyzing and processing a large corpus of noisy and formal texts.
Statistical (Aw et al., 2006), rule-based (Beaufort et al., 2010) and neural network techniques (Ikeda et al., 2016) from machine translation are used for text normalization. Since the neural machine translation (Cho et al., 2014) showed promising results, it has also been adapted to other problems such as text normalization and language correction. Xie et al. (2016) applied characterbased sequence modelling with attention mechanism for language correction. The most closely related previous work to our study is Ikeda et al. (2016). They used a neural encoder-decoder model for normalizing noise in Japanese text introduced by the usage of three different writing systems. They also built a synthetic database with predefined rules for data augmentation. They compared their neural network model with rulebased methods, while we compare our neural network model with a probabilistic model.

Method
For UULA restoration, the aim is to recover the target sequence Y from the source sequence X. Word-based or character-based models can be used for this. In the character-based model, X =< l x 1 , l x 2 , . . . , l x n >, Y =< l y 1 , l y 2 , . . . , l y n > where l x 1 is the first character of X, and n is the length of the word(s) . On the other hand, for the word-based model, X =< w x 1 , w x 2 , . . . , w x m >, Y =< w y 1 , w y 2 , . . . , w y m > where m is number of words in X or Y , and w is a word. For wordbased restoration, we adopt the noisy channel model. Meanwhile, we use an encoder-decoder based sequence to sequence model for characterbased restoration. In fact, both of models can be character or word based. In the encoder-decoder model, to reduce the input dimension, we picked the character-based solution over the word-based. However, we choose the word-based solution for the noisy channel model because of simple implementation and robust filtering with a dictionary.

Noisy Channel Model (NCM)
Noisy channel model (Church and Gale, 1991;Mays et al., 1991) is a widely applied method for spell checking. It assumes spelling mistakes were introduced while inputs were passing through a noisy communication channel. If P is the probabilistic model of the noisy channel, then the correct word w y i , from the dictionary V , corresponding to the word w x i can be found by using the following formula: Equation 3 shows that the target word w y i depends on conditional probability P (w x i |w) and prior probability P (w). P (w) is calculated with the language model, while P (w y i |w) is calculated with the error model. The error model is achieved with static analysis on real error samples. Since our error samples are created synthetically, we build the error model with the same confusion table with which we generated corrupt data. Here, the confusion table is at the character-level but we need a word-level confusion table. In order to overcome this issue, we 88 apply the Bledsoe-Browning technique (Bledsoe and Browning, 1959). It calculates the word-level confusion probability by multiplying the confusion probability of the letters as in Equation 4.

Neural Encoder-Decoder Model (NEDM)
From a different perspective, the text normalization task can be considered as a text regeneration process starting with the information extracted from noisy data. We can view text reconstruction as rewriting new text with same meaning. During generation, the text process model (encoder) extracts abstract information from un-normalized text. The generalization model (decoder) starts to generate the text once it receives information from the text processing model. The generation model is trained by maximizing the probability of the generated text, P (Y ). According to the chain rule, it is decomposed into: where M is the length of the sequence, and y i is a unit in the sequence. Therefore, we need a model that learns the conditional distributions: p(y i |y 1 , y 2 , . . . , y i−1 ). The encoder-decoder model in (Cho et al., 2014) works in a similar fashion. It divides manyto-many mappings into many-to-one and one-tomany mappings. The encoder does a many-to-one mapping, while the decoder performs a one-tomany mapping. Both the encoder and the decoder are recurrent neural networks. One of the advantages of this model is that the encoder and the decoder are jointly trained to maximize the conditional probability, P (Y |X). P (Y |X) = M t=1 p(l y i |l y 1 , l y 2 , . . . , l y i−1 , X) As the Figure 2 and Equation 6 show, the encoder extracts abstract information W from input X, and then the decoder starts generating target text sequentially with the information that comes from the encoder and the previous time step.

Dataset
In the experiments, we use both synthetic and authentic data. We train/build our models with synthetic data because of limited access to the real cases and difficulties of building ground truth. Nevertheless, we conduct tests both on synthetic and real data that we have collected. 3

Synthetic Data
The synthetic dataset used in our experiments is built by scrawling raw text from news websites such as "tianshannet.com", "okyan.com" and "uycnr.cn". In total, we collected 2GB of data for training and testing, 10 text files of different genres, each of which includes around 586 words. Note that these data are written in UPAA, while we convert them to the CTA format for convenience.
The training of the encoder-decoder model uses pairs of source and target sequences. Target sequences are collected from raw text, while source sequences are created synthetically by randomly replacing letters in the target sequence using the mapping shown in Table 4. Notice that words in synthetic UULA text may include more characters than ground-truth target words. This is caused by replacing some single letters by double letters. For example, ş to sh , ç to ch, and so on. To ensure that corresponding words in source-target pairs have the same length, we pad n "w"s at the end of a target word whose corresponding source word includes n additional letters. The reason for choosing the character "w" is that it is not in CTA. Similarly, we generate the target and source text for testing. However, for more convincing test results on synthetic data, we generated 10 different source texts for each of the target text. Testing results on each of the synthetic files are the mean of 10 cases, while the final accuracy of all synthetic data is the mean of all the results on the synthetic files.

Real Data
We collect 226 sentences (1372 words in total) from social media platforms such as "Wechat", "Facebook" and so on. For building the ground truth, we first use our model for restoration. Secondly, we restore texts manually. Finally, we apply a spell-checker for further restoring. While collecting real data and building the corresponding ground, we found that the real data has more noise than the usual UULA. We found in real data that there are various types of spelling errors, misuse of punctuation and repetitions.

Neural Encoder-Decoder Model
We built our neural encoder-decoder model with TensorFlow (Abadi et al., 2015). Both encoder and decoder models used three layers stacked LSTMs with 256 hidden units and 256 dimension character embeddings. For training the model, the Adam optimizer with 0.0001 learning rate is applied. We trained the model in only 2 epochs with a 128 batch size. We selected the model with the best validation results on the validation set that is described below. The training process is accomplished on Tesla K40 GPU.
In this model, the length of the target and the source sequences is 30, and, instead of special tokens, blank space is placed at the beginning and the end of a sequence. Note that these sequences are constructed by grouping words in the raw text by keeping sequence length under 30. We build them as follows: First, we tokenize the text with blank space or new line character, then we append a blank space to the beginning of each token. Then, we concatenate them in order by keeping the sequence length at maximum 30. If concatenating the next word makes the current sequence length bigger than 30, then only blank spaces are appended. However, the new sequence will start from the next word. In total, we generate 63,824,760 sequences. We divide them into training, validation and testing sets in this portion: 60%, 20%, 20%.

Noisy Channel Model
The channel probability, in other words, the error model, in the NCM is generated according to the Table 1. For example, the probability of l 1 ='ş' turning into l 2 ='x', p(l 2 |l 1 ) is 1/3, since ş has three alternatives. We generate a 3-gram language model by running Kenlm language modeling tool (Heafield, 2011) on our collected text. The Noisy channel model method normalizes the text word-by-word by selecting the most probable candidate from all possible candidates by ranking their probabilities. These candidates are generated with Table 1. For example, the word "xax" will have 8 candidates: "xax, şax, xex, şex, xeş, şeş, şaş, xaş", since both "x" and "a" have two alternatives. According to our experiment, on average, 1074 candidates are proposed for each word. However, we filter these candidates with the use of a dictionary. The dictionary includes all unique words from the raw text. With this dictionary filter, 1074 candidates are filtered to an average of 1.6 candidates. After filtering, a candidate is passed to the noisy channel model to find the candidate with the highest likelihood. If all candidates are filtered, then the original is kept.

Results and Analysis
The performance of two models is evaluated by conducting two tests: UULA text restoration test and the IME recommendation test. The former tests the accuracy the model on restoring documents with UULA noise. On the other hand, the latter checks a model's prediction accuracy of the word being typed. In the IME recommendation test, we conjecture that the models have limited access to previous words. Therefore, we test two models by providing a limited number of previous words to them (at most two words in IME testing). In fact, the noisy channel model always has limited access to the previous context, therefore its results are the same for two tests.
Accuracy results of the tests are calculated as in Equation 7.

Accuracy =
# of correct words # of words (7) where "correct words" means correctly recommended or restored words. We did not calculate the precision-recall value, since the recall is always equal to 1, and precision is equal to the accuracy. From Table 6, we can see both the neural encoder-decoder model and the noisy channel model show high performance on the synthetic dataset. However, the noisy channel model is slightly better than the encoder-decoder model. Table 7 shows that both of the models are suitable for developing IME specialized for UULA restoration. However, the 2-gram noisy channel model returns the best performance. We believe that there are three possible explanations for why the NCM outperforms the NEDM on the synthetic dataset: 1) The dictionary used in NCM is very robust, it filters out almost all of the unqualified candidates.
2) The channel model used in NCM is too ideal because it is exactly calculated not generally approximated.
3) The NEDM model needs more training with synthetic data pairs.
In the real cases as Table 8 shows, the neural encoder-decoder model is slightly better than the noisy channel model. In the real dataset, some words are not included in the dictionary, therefore noisy channel model cannot restore them correctly. Besides, other factors such as spelling errors, misuse of punctuation and redundant repeating bring more challenges to the noisy channel model as compared to the neural encoder-decoder model, since the former works at word-level but the latter at character-level.   In Tables 9, 10 and Figure 3, the qualitative results are given, where both NCM and NEDM fail to restore certain noisy words. The NCM fails in restoring a noisy word when the corresponding Table 9: Examples of comparison of two models and the baselines on synthetic UULA texts (Underlined means the original noisy text. Italic means the text is erroneously restored to nonstandard text. Bold means the text is wrongly restored to an unwanted (but in dictionary) text).  original word does not appear in the dictionary or has an ignorable N-gram score. Meanwhile, the NEDM model tends to map characters to popular patterns. Therefore, in a few cases, it restores noisy words to unexpected ones.

Conclusion and Future Work
In this work, we propose two models for normalizing Uyghur UULA texts. The noisy channel model views the problem as a spell-checking problem, while the neural encoder-decoder model views it as a machine translation problem. Both of them return highly accurate results on restoration and recommendation tasks on the synthetic dataset. However, their accuracy on real datat would benefit from further improvement. To improve their performance on the real dataset, one possible strategy is to consider other noisy factors appearing in the real dataset. In future work, we will update our models to handle other noisy elements such as spelling errors and the misuse of punctuation on the real dataset. However, we believe that it is eas-ier to adapt the neural encoder-decoder model to the new challenges than the noisy channel model. This is because it only requires fine-tuning on extra data for different kinds of noise, while the noisy channel model requires redesigning of the model structure.