Tweet Normalization with Syllables

In this paper, we propose a syllable-based method for tweet normalization to study the cognitive process of non-standard word creation in social media. Assuming that syllable plays a fundamental role in forming the non-standard tweet words, we choose syllable as the basic unit and extend the conventional noisy channel model by incorporating the syllables to represent the word-to-word transitions at both word and syllable levels. The syllables are used in our method not only to suggest more candidates, but also to measure similarity between words. Novelty of this work is three-fold: First, to the best of our knowledge, this is an early attempt to explore syllables in tweet normalization. Second, our proposed normalization method relies on unlabeled samples, making it much easier to adapt our method to handle non-standard words in any period of history. And third, we conduct a series of experiments and prove that the proposed method is advantageous over the state-of-art solutions for tweet normalization.


Introduction
Due to the casual nature of social media, there exists a large number of non-standard words in text expressions which make it substantially different from formal written text. It is reported in ) that more than 4 million distinct out-of-vocabulary (OOV) tokens are found in the Edinburgh Twitter corpus (Petrovic et al., 2010). This variation poses challenges when performing natural language processing (NLP) tasks (Sproat et al., 2001) based on such texts. Tweet normalization, aiming at converting these OOV non-standard words into their in-vocabulary (IV) formal forms, is therefore viewed as a very important pre-processing task.
Researchers focus their studies in tweet normalization at different levels. A character-level tagging system is used in (Pennell and Liu, 2010) to solve deletion-based abbreviation. It was further extended in  using more characters instead of Y or N as labels. The character-level machine translation (MT) approach (Pennell and Liu, 2011) was modified in (Li and Liu, 2012a) into character-block. While a string edit distance method was introduced in (Contractor et al., 2010) to represent word-level similarity, and this orthographical feature has been adopted in (Han and Baldwin, 2011), and (Yang and Eisenstein, 2013).
Challenges are encountered in these different levels of tweet normalization. In the characterlevel sequential labeling systems, features are required for every character and their combinations, leading to much more noise into the later reverse table look-up process . In the character-block level MT systems equal number of blocks and their corresponding phonetic symbols are required for alignment (Li and Liu, 2012b). This strict restriction can result in a great difficulty in training set construction and a loss of useful information. Finally, word-level normalization methods cannot properly model how non-standard words are formed, and some patterns or consistencies within words can be omitted and altered.
We observe the cognitive process that, given non-standard words like tmr, people tend to first segment them into syllables like t-m-r. Then they will find the corresponding standard word with syllables like to-mor-row. Inspired by this cognitive observation, we propose a syllable based tweet normalization method, in which nonstandard words are first segmented into syllables. Since we cannot predict the writers deterministic intention in using tmr as a segmentation of tm-r (representing tim-er) or t-m-r (representing to-mor-row), every possible segmentation form is considered. Then we represent similarity of standard syllables and non-standard syllables using an exponential potential function. After every transition probabilities of standard syllable and non-standard syllable are assigned, we then use noisy channel model and Viterbi decoder to search for the most possible standard candidate in each tweet sentence.
Our empirical study reveals that syllable is a proper level for tweet normalization. The syllable is similar to character-block but it represents phonetic features naturally because every word is pronounced with syllables. Our syllable-based tweet normalization method utilizes effective features of both character-and word-level: (1) Like characterlevel, it can capture more detailed information about how non-standard words are generated; (2) Similar to word-level, it reduces a large amount of noisy candidates. Instead of using domain-specific resources, our method makes good use of standard words to extract linguistic features. This makes our method extendable to new normalization tasks or domains.
The rest of this paper is organized as follows: previous work in tweet normalization are reviewed and discussed in Section 2. Our approach is presented in Section 3. In Section 4 and Section 5, we provide implementation details and results. Then we make some analysis of the results in Section 6. This work is finally concluded in Section 7.

Related Work
Non-standard words exhibit different forms and change rapidly, but people can still figure out their original standard words. To properly model this human ability, researchers are studying what remain unchanged under this dynamic characteristic. Human normalization of an non-standard word can be as follows: After realizing the word is non-standard, people usually first figure out standard candidate words in various manners. Then they replace the non-standard words with the standard candidates in the sentence to check whether the sentence can carry a meaning. If not, they switch to a different candidate until a good one is found. Most normalization methods in existence follow the same procedure: candidates are first generated, and then put into the sentence to check whether a reasonable sentence can be formed. Differences lie in how the candidates are generated and weighted. Related work can be classified into three groups.

Orthographical similarity
Orthographical similarity is built upon the assumption that the non-standard words look like its standard counterparts, leading to a high Longest Common Sequence (LCS) and low Edit Distance (ED). This method is widely used in spell checker, in which the LCS and ED scores are calculated for weighting possible candidates. However, problems are that the correct word cannot always be the most looked like one. Taking the nonstandard word nite for example, note looks more likely than the correct form night. To overcome this problem, an exception dictionary of strongly-associated word pairs are constructed in (Gouws et al., 2011). Further, these pairs are added into a unified log-linear model in (Yang and Eisenstein, 2013) and Monte Carlo sampling techniques are used to estimate parameters.

Phonetic similarity
The assumption underlying the phonetic similarity is that during transition, non-standard words sound like the standard counterparts, thus the pronunciation of non-standard words can be traced back to a standard dictionary. The challenge is the algorithm to annotate pronunciation of the nonstandard words. Double Metaphone algorithm (Philips, 2000) is used to decode pronunciation and then to represent phonetic similarity by edit distance of these transcripts (Han and Baldwin, 2011). IPA symbols are utilized in (Li and Liu, 2012b) to represent sound of words and then word alignment-based machine translation is applied to generate possible pronunciation of non-standard words. And also, phoneme is used in  as one kind of features to train their CRF model.

Contextual similarity
It is accepted that after standard words are transformed into non-standard words, the meaning of a sentence remains unchanged. So the normalized standard word must carry a meaning. Most researchers use n-gram language model to normalize a sentence, and several researches use more contextual information. For example, training pairs are generated in ) by a cosine contextual similarity formula whose items are defined by TF-IDF scheme. A bipartite graph is constructed in (Hassan and Menezes, 2013) to represent tokens (both non-standard and standard words) and their context. Thus, random walks on the graph can represent contextual-similarity between non-standard and standard words. Very recently, word-embedding (Mikolov et al., 2010;Mikolov et al., 2013) is utilized in (Li and Liu, 2014) to represent more complex contextual relationship.
In word-to-word candidate selection, most researches use orthographical similarity and phonetic similarity separately. In the log-linear model (Yang and Eisenstein, 2013), edit distance is modeled as major feature. In the character-and phonebased approaches (Li and Liu, 2012b), orthographical information and phonetic information were treated separately to generate candidates.
In (Han and Baldwin, 2011), candidates from lexical edit distance and phonemic edit distance are merged together. Then an up to 16% increasing recall was reported when adding candidates from phonetic measure. But improper processing level makes it difficult to model the two types of information simultaneously: (1) Single character can hardly reflect orthographical features of one word.
(2) As fine-grained reasonable restrictions are lacked, as showed in (Han and Baldwin, 2011), several times of candidates are included when adding phonetic candidates and this will bring much more noise. To combine orthographical and phonetic measure in a fine-grained level, we proposed the syllable-level approach.

Framework
The framework of the proposed tweet normalization method is presented in Figure 1. The proposed method extends the basic HMM channel model (Choudhury et al., 2007;Cook and Stevenson, 2009) into syllable level. And the following four characteristics are very intersting.
(1) Combination: When reading a sentence, fast subvocalization will occur in our mind.
In the process, some non-standard words generated by phonetic substitution are correctly pronounced and then normalized. And also, because subvocalization is fast, people tend to ignore some minor flaws in spelling intentionally or unintentionally. As this often occurs in people's real-life interacting with these social media language, we believe the combination of phonetic and orthographical information is of great significance.
(2) Syllable level: Inspired by Chinese normalization (Xia et al., 2006) using pinyin (phonetic transcripts of Chinese), syllable can be seen as basic unit when processing pronunciation. Different from mono-syllable Chinese words, English words can be multi-syllable; this will bring changes in our method that extra layers of syllables must be put into consideration. Thus, apart from word-based noisy-channel model, we extend it into a syllable-level framework.
(3) Priori knowledge: Priori knowledge is acquired from standard words, meaning that both standard syllabification and pronunciation can shed some lights to non-standard words. This assumption makes it possible to obtain non-standard syllables by standard syllabification and gain pronunciation of syllables by standard words and rules generated with them.
(4) General patterns: Social media language changes rapidly while labeled data is expensive thus limited. To effectively solve the problem, linguistic features instead of statistical features should be emphasized. We exploit standard words of their syllables, pronunciation and possible transition patterns and proposed the four-layer HMM-based model (see Figure 1).
In our method, non-standard words c i are first segmented into syllables sc i , we calculate their similarity by combining the orthographical and phonetic measures. Standard syllables sw make up one standard candidates. Since candidates are generated and weighted, we can use Viterbi decoder to perform sentence normalization. Table 1 shows some possible candidates for the nonstandard word tmr.

Method
We extend the noisy channel model to syllablelevel as follows: where w indicates the standard word and c the non-standard word, and sw and sc represent their syllabic form, respectively. To simplify the problem, we restrict the number of standard syllables equals to the number of non-standard syllables in our method.
Assuming that syllables are independent of each other in transforming, we obtain: For syllable similarity, we use an exponential potential function to combine orthographical distance and phonetic distance. Because pronunciation can be represented using letter-to-phone transcripts, we can treat string similarity of these tmr t-mr tm-r t-m-r tamer ta-mer tim-er to-mor-row ti-mor tim-ber tri-mes-ter ti-more ton-er tor-men-tor tu-mor tem-per ta-ma-ra . . . . . . . . .
Φ(sc, sw) = exp(λ(LCS(sc, sw) − ED(sc, sw)) +(1 − λ)(P LCS(sc, sw) − P ED(sc, sw))) Exponential function grows tremendously as its argument increases, so much more weight can be assigned if syllables are more similar. The parameter λ here is used to empirically adjust relative contribution of letters and sounds. Longest common sequence (LCS) and edit distance (ED) are used to measure orthographical similarity, while phonetic longest common sequence (PLCS) and phonetic edit distant (PED) are used to measure phonetic similarity but based on letter-to-sound transcripts. The PLCS are defined as basic LCS but PED here is slightly different.
When performing phonetic similarity calculation based on syllables, we follow (Xia et al., 2006) in treating consonant and vowels separately because transition of consonants can make a totally different pronunciation. So if consonants of sc j and sw j are exactly the same or fit rules listed in Table 2, P ED(sc j , sw j ) equals to edit

Parameter
Parameter in the proposed method is only the λ in Equation (5), which represents the relative contribution of orthographical similarity and phonetic similarity. Because the limited number of annotated corpus, we have to enumerate the parameter in {0, 0.1, 0.2, ..., 1} in the experiment to find the optimal setting.

Implementation
The method described in the previous section are implemented with the following details.

Preprocessing
Before performing normalization, we need to process several types of non-standard words: • Words containing numbers: People usually substitute some kind of sounds with numbers like 4/four, 2/two and 8/eight or numbers can be replacement of some letters like 1/i, 4/a. So we replace numbers with its words or characters and then use them to generate possible candidates.
• Words with repeating letters: As our method is syllable-based, repeating letters for sentiment expressing (like cooool, (Brody and Diakopoulos, 2011)) can cause syllabifying failure. For repeating letters, we reduce it to both two and one to generate candidate separately. Then the two lists are merged together to form the whole candidate list.

Letter-to-sound conversion
Syllable in this work refers to orthographic syllables. For example, we convert word tomorrow into to-mor-row. However, when comparing the syllable of a standard word and that of a nonstandard word, sound (i.e., phones) of the syllables are considered. Thus letter-to-sound conversion tools are required. Several TTS system can perform the task according to some linguistic rules, even for nonstandard words. The Double Metaphone algorithm used in (Han and Baldwin, 2011) is one of them. But it uses consonants to encode a word, which gives less information than we need. In our method, we use freeTTS (Walker et al., 2002) with CMU lexicon 1 to transform words into APRAbet 2 symbols. For example, word tomorrow is transcribed to {T-UW M-AA R-OW} and tmr to {T M R}.

Dictionary preparation
• Dictionary #1: In-vocabulary (IV) words Following (Yang and Eisenstein, 2013), our set of IV words is also based on the GNU aspell dictionary (v0.60.6). Differently, we use a collection of 100 million tweets (roughly the same size of Edinburgh Twitter corpus) because the Edinburgh Twitter corpus is no longer available due to Twitter policies. The final IV dictionary contains 51,948 standard words.
• Dictionary #2: Syllables for the standard words Following (Pennell and Liu, 2010), we use the online dictionary 3 to extract syllables for each standard words. We encountered same problem when accessing words with prefixes or suffixes, which are not syllabified in the same format as the base words on the website. To address the issue, we simply regard these prefixes and suffixes as syllables.
• Dictionary #3: Pronunciation of the syllables Using the CMU pronouncing dictionary (Weide, 1998) and dictionary 2, and knowing all possible APRAbet symbol for all consonant characters, we can program to capture every possible pronunciation of all syllables in the standard dictionary.

Automatic syllabification of non-standard words
Automatic syllabification of non-standard words is a supervised problem. A straightforward idea is to train a CRF model on manually labeled syllables of non-standard words. Unfortunately, such a corpus is not available and very expensive to produce. We assume that both standard and non-standard forms follow the same syllable rules (i.e., the cognitive process). Thus we propose to train the CRF model on the corpus of syllables of standard words (which is easy to obtain) to construct an automatic annotation system based on CRF++ (Kudo, 2005). In this work, we extract syllables of standard words from Dictionary #2 as training set. Annotations follow (Pennell and Liu, 2010) to identify boundaries of syllables and in our work, CRF++ can suggest several candidate solutions, rather than an optimal segmentation solution for syllable segmentation of the non-standard words.
In the HMM channel model, the candidate solutions are included as part of the search space.

Datasets
We use two labeled twitter datasets in existence to evaluate our tweet normalization method.
• LexNorm1.2 is a revised version of LexNor-m1.1 (Yang and Eisenstein, 2013). Some inconsistencies and errors in LexNorm1.1 are corrected and some more non-standard words are properly recovered.
In both datasets, to-be-normalized non-standard words are detected manually as well as the corresponding standard words.

Evaluation criteria
Here we use precision, recall and F-score to evaluate our method. As normalization methods on these datasets focused on the labeled nonstandard words (Yang and Eisenstein, 2013), recall is the proportion of words requiring normalization which are normalized correctly; precision is the proportion of normalizations which are correct. When we perform the tweet normalization methods, every error is both a false positive and false negative, so in the task, precision equals to recall.
• (Han and Baldwin, 2011): the orthographyphone combined system using lexical edit distance and phonemic edit distance.
In our method, we set λ=0.7 because it is found best in our experiments (see Figure 2). The experimental results are presented in Table 3, which indicate that our method outperforms the state-of-the-art methods. Details on how to adjust parameter is given in Section 5.4.
Recall we argue that combination of three similarity is necessary when performing sentence-level normalization. Apart from contextual similarity like language model or graphic model, methods in (Yang and Eisenstein, 2013) or (Hassan and Menezes, 2013) do not include phonetic measure, causing loss of important phonetic information. Though using phoneme, morpheme boundary and syllable boundary as features , the character-level reversed approach will bring much more noise into the later reversed look-up table, and also, features of whole word are omitted.
Like (Han and Baldwin, 2011), we also use lexical measure and phonetic measure. Great difference between the two approaches is the processing level: word level and syllable level. In their work, average candidates number suffers times of increase when adding phonetic measure. This is because when introducing phonemic edit distance, important pronunciations can be altered (phonemic edit distance of night-need and night-kite is equal). Syllable level allows us to reflect consistencies during transition in a finergrained level. Thus the phonetic similarity can be more precisely modeled.

Contributions of phone and orthography
In our method, the parameter λ in Equation 5 is used to represent the relatively contributions of both phonetic and orthographical information. But as the lack of prior knowledge, we cannot judge an optimal λ. We choose to conduct experiments varying λ = {0, 0.1, ..., 1} to find out how this adjustment can affect performance. The experimental results are presented in Figure 2. As shown in Figure 2, when λ is set 0 or 1 (indicating no contribution of either orthographical or phonetic in assigning weight to candidates), our method performs much worse. In our experiment, when λ = 0.7, the models performs best, showing that orthographical measure makes relatively more contribution over phonetic measure, but the latter is indispensable. This justifies the effectiveness of combining orthographical and phonetic measure, indicating that human normalization process is properly modeled.

Our exceptions
Deeper observation of our normalization results shows that there are several types of exceptions beyond our consonant-based rules. For example, thanks fails to be selected as a candidate for the non-standard word thx because the pronunciation of thanks contains an N but thx does not. The same situation happens when we process stong/strong because of the lacking R. We believe some more consonant should be exploited and more precisely described.

Non-standard words involving multiple syllables
There are one type of transition that we cannot solve like acc/accelerate and bio/biology because the mapping is between single-syllable word and multi-syllable word. We add possible standard syllable sw (i) 0 and sw (i) k+1 to the head and tail of origin syllables, but this extended form failed to be assigned high probability because the string edit distances are too large. We leave this problem for further research.

Annotation issue
Though similar, our results of LexNorm1.2 is better than LexNorm1.1. After scrutinizing, we notice that several issues in LexNorm1.1 are fixed in LexNorm1.2. So our results like meh/me (meaning the non-standard word meh are corrected to me) in LexNorm1.1 is wrong but in LexNor-m1.2 is right. Even in LexNorm1.2, there exist some inconsistencies and errors. For example, our result buyed/bought is wrong for both datasets, which is actually correct. For another example, til is normalized to until in some cases but to till in other cases. We show that the LexNorm test corpus is still imperfect. We appeal for systematic efforts to produce a standard dataset under a widely-accepted guideline.

Conventions
Social media language often contains words that are culture-specific and widely used in daily life. Some word like congrats, tv and pic are included into several dictionaries. We also observed several transitions like atl/atlanta or wx/weather in the datasets. These kinds of conventional abbreviations pose great difficulty to us. Normalization of those conventional nonstandard words still needs further study.

Conclusion
In this paper, a syllable-based tweet normalization method is proposed for social media text normalization. Results on publicly available standard datasets justify our assumption that syllable plays a fundamental role in social media non-standard words. Advantage of our proposed method lies in that syllable is viewed as the basic processing unit and syllable-level similarity. This accords to the human cognition in creating and understanding the social non-standard words. Our method is domain independent. It is robust on non-standard words in any period of history. Furthermore, give the syllable transcription tool, our method can be easily adapted to a new language.