Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance

Building tools for code-mixed data is rapidly gaining popularity in the NLP research community as such data is exponentially rising on social media. Working with code-mixed data contains several challenges, especially due to grammatical inconsistencies and spelling variations in addition to all the previous known challenges for social media scenarios. In this article, we present a novel architecture focusing on normalizing phonetic typing variations, which is commonly seen in code-mixed data. One of the main features of our architecture is that in addition to normalizing, it can also be utilized for back-transliteration and word identification in some cases. Our model achieved an accuracy of 90.27% on the test data.


Introduction
With rising popularity of social media, the amount of data is rising exponentially. If mined, this data can proof to be useful for various purposes. In countries where the number of bilinguals are high, we see that users tend to switch back and forth between multiple languages, a phenomenon known as code-mixing or code-switching. An interesting case is switching between languages which share different native scripts. On such occasions, one of the two languages is typed in it's phonetically transliterated form in order to use a common script. Though there are some standard transliteration rules, for example ITRANS 1 , ISO 2 , but it is extremely difficult and un-realistic for people to follow them while typing. This indeed is the case as we see that identical words are being transliterated differently by different people based on their own phonetic judgment influenced by dialects, location, or sometimes even based on the informality or casualness of the situation. Thus, for cre-ating systems for code-mixed data, post language tagging, normalization of transliterated text is extremely important in order to identify the word and understand it's semantics. This would help a lot in systems like opinion mining, and is actually necessary for tasks like summarization, translation, etc. A normalizing module will also be of immense help while making word embeddings for code-mixed data.
In this paper, we present an architecture for automatic normalization of phonetically transliterated words to their standard forms. The language pair we have worked on is Bengali-English (Bn-En), where both are typed in Roman script, thus the Bengali words are in their transliterated form. The canonical or normalized form we have considered is the Indian Languages Transliteration (ITRANS) form of the respective Bengali word. Bengali is an Indo-Aryan language of India where 8.10% 3 of the total population are 1 st language speakers and is also the official language of Bangladesh. The native script of Bengali is the Eastern Nagari Script 4 . Our architecture utilizes fully char based sequence to sequence learning in addition to Levenshtein distance to give the final normalized form or as close to it as possible. Some additional advantages of our system is that at an intermediate stage, the back-transliterated form of the word can be fetched (i.e. word identification), which will be very useful in several cases as original tools (i.e. tools using native script) can be utilized, for example emotion lexicons. Some other important contributions of our research are the new lexicons that have been prepared (discussed in Sec 3) which can be used for building various other tools for studying Bengali-English code-mixed data.

Related Work
Normalization of text has been studied quite a lot (Sproat et al., 1999), especially as it acts as a pre-processing step for several text processing systems. Using Conditional Random Fields (CRF), Zhu et al. (2007) performed text normalization on informal emails. Dutta et al. (2015) created a system based on noisy channel model for text normalization which handles wordplay, contracted words and phonetic variations in codemixed background. An unsupervised framework was presented by Sridhar (2015) for normalizing domain-specific and informal noisy texts using distributed representation of words. The soundex algorithm was used in (Sitaram et al., 2015) and (Sitaram and Black, 2016) for spelling correction of transliterated words and normalization in a speech to text scenario of code-mixed data respectively. Sharma et al. (2016) build a normalization system using noisy channel framework and SILPA spell checker in order to build a shallow parser. Sproat and Jaitly (2016) build a system combining two models, where one essentially is a seq2seq model which checks the possible normalizations and the other is a language model which considers context information. Jaitly and Sproat (2017) used a seq2seq model with attention trained at sentence level followed by error pruning using finite-state filters to build a normalization system, mainly targeted for text to speech purposes. A similar flow was adopted by Zare and Rohatgi (2017) as well where seq2seq was used for normalization and a window of size 20 was considered for context. Singh et al. (2018) exploited the fact that words and their variations share similar context in large noisy text corpora to build their normalizing model, using skip gram and clustering techniques. To the best of our knowledge, the system architecture proposed by us hasn't been tried before, especially for code-mixed data.

Data Sets
On a whole, three data sets or lexicons were created. The first data set was a parallel lexicon (PL) where the 1 st column had phonetically transliterated Bn words in Roman taken from code-mixed data prepared in Mandal et al. (2018b). The 2 nd column consisted of the standard Roman transliterations (ITRANS) of the respective words. To get this, we first manually back-transliterated PL col 1 to the original word in Eastern Nagari script, and then converted it into standardized ITRANS format. The final size of the PL was 6000. The second lexicon we created was a transliteration dictionary (BN TRANS) where the first column had Bengali words in Eastern Nagari script taken from samsad 5 , while the second column had the standard transliterations (ITRANS). The number of entries in the dictionary was 21850. For testing, we took the data used in Mandal and Das (2018), language tagged it using the system described in Mandal et al. (2018a), and then collected Bn tagged tokens. Post manual checking and discarding of misclassified tokens, the size of the list was 905. Finally, each of the words were tagged with their ITRANS using the same approach used for making PL. For PL col 1 and test data, some initial rule based normalization techniques were used. If the input string contains a digit, it was replaced by the respective phone (e.g. ek for 1, dui for 2, etc), and if there are n consecutive identical characters where n > 2 (elongation), it was trimmed down to 2 consecutive characters (e.g. baaaad will become baad), as no word in it's standard form has more than two consecutive identical characters.

Proposed Method
Our method is a two step modular approach comprising of two degrees of normalization. The first normalization module does an initial normalization and tries to convert the input string closest to the standard transliteration. The second normalization module takes the output from the first module and tries to match with the standard transliterations present in the dictionary (BN TRANS). The candidate with the closest match is returned as the final normalized string.

First Normalization Module
The purpose of this module is to phonetically normalize the word as close to the standard transliteration as possible, to make the work of the matching module easier. To achieve this, our idea was to train a sequence to sequence model where the input sequences are user transliterated words and the target sequences are the respective ITRANS transliterations. We had specifically chosen this architecture as it has performed amazingly well in complex sequence mapping tasks like neural machine translation and summarization.

Seq2Seq Model
The sequence to sequence model (Sutskever et al., 2014) is a relatively new idea for sequence learning using neural networks. It has been especially popular since it achieved state of the art results in machine translation task. Essentially, the model takes as input a sequence X = {x 1 , x 2 , ..., x n } and tries to generate another sequence Y = {y 1 , y 2 , ..., y m }, where x i and y i are the input and target symbols respectively. The architecture of seq2seq model comprises of two parts, the encoder and decoder. As the input and target vectors were quite small (words), attention (Vaswani et al., 2017) mechanism was not incorporated.

Encoder
Encoder essentially takes a variable length sequence as input and encodes it into a fixed length vector, which is supposed to summarize it's meaning taking into context as well. A recurrent neural network (RNN) cell is used to achieve this. The directional encoder reads the sequence from one end to the other (left to right in our case).
Here, E x is the input embedding lookup table (dictionary), f enc are the transfer function for the recurrent unit e.g. Vanilla, LSTM or GRU. A contiguous sequence of encodings C = {h 1 , h 2 , ..., h n } is constructed which is then passed on to the decoder.

Decoder
Decoder takes input context vector C from the encoder, and computes the hidden state at time t as, Subsequently, a parametric function out k returns the conditional probability using the next target symbol being k. Here, the concept of teacher forcing is utilized, the strategy of feeding output of the model from a prior time-step as input.
Z is the normalizing constant j exp(out j (E y (y t − 1), s t , c t ))

Training
The model is trained by minimizing the negative log-likelihood. For training, we used the fully character based seq2seq model (Lee et al., 2016) with stacked LSTM cells. The input units were user typed phonetic transliterations (PL col 1 ) while the target units were respective standard transliterations (PL col 2 ). Thus, the model learns to map user transliterations to standard transliterations, effectively learning to normalize phonetic variations. The lookup table E x we used for character encoding was a dictionary where the keys were the 26 English alphabets and the values were the respective index. Encodings at character level were then padded to the length of the maximum existing word in the dataset, which was 14, and was converted to one-hot encodings prior to feeding the to the seq2seq model. We created our seq2seq model using the Keras (Chollet et al., 2015) library. The batch size was set to 64, and number of epochs was set to 100. The size of the latent dimension was kept at 128. Optimizer we chose was rmsprop, learning rate set at 0.001, loss was categorical crossentropy and transfer function used was softmax. Accuracy and loss graphs during training with respect to epochs are shown in Fig 1. As we can see from Fig 1, the accuracy reached at the end of training was not too high (around 41.2%) and the slope became asymptotic. This is quite understandable as the amount of training data was relatively quite low for the task, and the phonetic variations were quite high. On running this module on our testing data, an accuracy of 51.04% was achieved. It should be noted that even a single character predicted wrongly by the softmax function reduces the accuracy.

Second Normalization Module
This module basically comprises of the string matching algorithm. For this, we have used Levenshtein distance (LD) (Levenshtein, 1966), which is a string metric for measuring difference between two sequences. It does so by calculating the minimum number of insertions, deletions and substitutions required for converting one sequence to the other. Here, the output from the previous module is compared with all the standard ITRANS entries present in BN TRANS and the string with the least Levenshtein distance is given as output, which is the final normalized form. If there are ties, the instance which has higher matches traversing from left to right is given more priority. Also, observing the errors from first normalizer, we noticed that in a lot of cases, the character pairs {a,o} and {b,v} are used interchangeably quite often (language specific phonological features), both in phonetic transliterations alone, as well as when compared with ITRANS. Thus, along with the standard approach, we tried a modified version as well where the cost of the above mentioned character pairs are same, i.e. they are treated as identical characters. This was simply done by assigning special symbols to those pairs, and replacing them in the input parameters. For example, post replacement, distance(chalo, chala) will become distance(ch$l$, ch$l$ ).

Evaluation
Our system was evaluated in two ways, one at word level and another at task level.

Word Level
Here, the basic idea was compare the normalized words with the respective standard transliterations. For this, the testing data discussed in Sec 3 was used. For comparison purposes, three other setups other than our proposed model (setup 4) were tested, all of which are described in Table 1. From Table 1, we can see that the jump in accu-racy from setup 1 to setup 3 is quite significant (by 30.94%). This proves that instead of simple distance comparison with lexicon entries, a prior seq2seq normalization can have great impact on the performance. Additionally, we can also see that when modified input is given to the Levenshtein distance (LD), the accuracies achieved are slightly better. On analyzing the errors, we found out that majority (92%) of them is due to the fact that the standard from was not present in BN TRANS, i.e. was out of vocabulary. These words were mostly slangs, expressions, or two words joined into a single one. The other 8% was due to the first module casuing substantial deviation from normal form. For deeper analysis, we collected the ITRANS of errors due out of vocab, and on comparison with the first normalizations, the mean LD was calculated to be 1.89, which is suggesting that if they were present in BN TRANS, the normalizer would have given the correct output.

Task Level
For task level evaluation, we decided to go with sentiment analysis using the exact setup and data described in Mandal et al. (2018b), on Bengali-English code-mixed data. All the training and testing data were normalized using our system along with the lexicons that are mentioned. Finally, the same steps were followed and the different metrics were calculated. The comparison of the systems prior (noisy) and post normalization (normalized) is shown in Table 2.  We can see an improvement in the accuracy (by 1.5%). On further investigation, we saw that the unigram, bigram and trigram matches with the bag of n-grams and testing data increased by 1.6%, 0.4% and 0.1% respectively. The accuracy can be improved further more if back-transliteration is done and Bengali sentiment lexicons are used but that is beyond the scope of this paper.

Discussion
Though our proposed model achieved high accuracy, some drawbacks are there. Firstly is the re-quirement for the parallel corpus (PL) for training a seq2seq model, as manual checking and backtransliteration is quite tedious. Speed of processing in terms of words/second is not very high due to the fact that both seq2seq and Levenshthein distance calculation is computationally heavy, plus the O(n) search time. For string matching, simpler and faster methods can be tested and search area reduction algorithms (e.g. specifying search domains depending on starting character) can be tried to improve the processing speed. A lexical checker can be added as well prior to seq2seq to see if the word is already in it's transliterated form.

Conclusion & Future Work
In this article, we have presented a novel architecture for normalization of transliterated words in code-mixed scenario. We have employed the seq2seq model with LSTM cells for initial normalization followed by evaluating Levenshthein distance to retrieve the standard transliteration from a lexicon. Our approach got an accuracy of 90.27% on testing data, and improved the accuracy of a pre-existing sentiment analysis system by 1.5%. In future, we would like to collect more transliterated words and increase the data size in order to improve both PL and BN TRANS. Also, combining this module with a context capturing system and expanding to other Indic languages will be one of the goals as well.