USZEGED: Correction Type-sensitive Normalization of English Tweets Using Efficiently Indexed n-gram Statistics

This paper describes the framework applied by team USZEGED at the “Lexical Normalisation for English Tweets” shared task. Our approach ﬁrst employs a CRF-based sequence labeling framework to decide the kind of corrections the individual tokens require, then performs the necessary modiﬁcations relying on external lexicons and a massive collection of efﬁ-ciently indexed n-gram statistics from English tweets. Our solution is based on the assumption that from the context of the OOV words, it is possible to reconstruct its IV equivalent, as there are users who use the standard English form of the OOV word within the same context. Our approach achieved an F-score of 0.8052, being the second best one among the un-constrained submissions, the category our submission also belongs to.


Introduction
Social media is a rich source of information which has been proven to be useful to a variety of applications, such as event extraction (Sakaki et al., 2010;Ritter et al., 2012; or trend detection, including the tracking of epidemics (Lamb et al., 2013). Analyzing tweets in general, however, can pose several difficulties. From an engineering point of view, the streaming nature of tweets requires that special attention is paid to the scalability of the algorithms applied and from an NLP point of view, the often substandard characteristics of social media utterances has to be addressed. The fact that tweets are often written on mobile devices and are informal makes the misspelling and abbreviations of words and expressions, as well as the use of creative informal language prevalent, giving rise to a higher number of out-of-vocabulary (OOV) words than in other genres.

Related Work
The informal language of social media, including Twitter, is extremely heterogeneous, making its grammatical analysis more difficult compared to standard genres such as newswire. It has been shown previously, that the performance of linguistic analyzers trained on standard text types degrade severely once they are applied to texts found in social media, especially tweets (Ritter et al., 2011;Derczynski et al., 2013).
In order to build taggers that perform more reliable on social media texts, one possible way is to augment the training data by including texts originating from social media (Derczynski et al., 2013). Such approaches, however, require considerable human effort, so one possible alternative can be to normalize the social media texts first, then apply standard analyzers on these normalized texts. Recently, a number of approaches have been proposed for the lexical normalization of informal (mostly social media and SMS) texts (Liu et al., 2011;Liu et al., 2012;Han et al., 2013;Yang and Eisenstein, 2013). Han and Baldwin (2011) rely on the identification of the words that require correction, then define a confusion set containing the candidate IV correction forms for such words. Finally, a ranking scheme, taking multiple factors into consideration, is applied which selects the most likely correction for an OOV word. In their subsequent work, Han et al. (2012) propose an automated method to construct accurate normalization dictionaries. Liu et al. (2011; propose a character-level sequence model to predict insertions, deletions and substitutions. They first collect a large set of noisy (OOV, IV) training pairs from the Web. These pairs are then aligned at the character level and provided as training data for a CRF classifier. The authors also released their 3,802-element normalization dictionary that our work also relies at. Yang and Eisenstein (2013) introduce an unsupervised log-linear model for the task of text normalization. Besides the features that can be derived from pairs of words (e.g. edit distance), features considering the context are also employed in their model. As the number of class labels in that model is equal to the size of the IV words an OOV word could possibly be corrected to (typically on the order of 10 4 -10 5 , which is far beyond the typical label size of classification tasks), the authors propose the use of Sequential Monte Carlo training approach for learning the appropriate feature weights.

The Task of Lexical Normalization
Formally, given an m-long sequence of words in the i th tweet, T i = [t i,1 , t i,2 , . . . , t i,m ], participants of the shared task had to return a sequence of normalized in-vocabulary (IV) words, i.e. S i = [s i,1 , s i,2 , . . . , s i,m ]. The training set of the shared task consisted of 2,950 tweets comprising 44,385 tokens, while the test set had 1,967 tweets which included a total of 29,421 tokens. According to the dataset, most of the words did not require any kind of corrections, i.e. the proportion of unmodified words was 91.12% and 90.57% for the training and test set, respectively. Further details with respect the shared task can be found in the paper (Baldwin et al., 2015).
As a consequence, we first built a sequence model to decide which tokens need to be corrected and in what way. A typical distinction of the correction types would be based on the number of tokens a noisy token and its corrected form comprises of. According to this approach, one could distinguish between one-to-one, one-to-many and many-to-one corrections on the per token basis. However, instead of applying the above types of corrections, we identified a more detailed categorization of the correction types and trained a linear chain CRF utilizing CRFsuite (Okazaki, 2007). The correction types a token could be classified as were the following: • M issingApos, standing for tokens that only differ from their corrected version in the absence of an apostrophe (e.g. youll → you'll), • M issingW S, standing for tokens that only • 1to1 ED≤2 , standing for corrections where no whitespace characters had to be inserted and the augmented edit distance (introduced in Section 4.2) between the noisy token and its normalized form was at most 2 (e.g. tmrw→tomorrow), • 1to1 ED≥3 , standing for corrections where no whitespace characters had to be inserted and the augmented edit distance was at least 3 (e.g. plz→please), • 1toM ABB , standing for corrections where both whitespace and alphanumeric characters had to be inserted to obtain a tokens corrected variant (e.g. lol → laugh out loud).
For the sake of completeness, we should add that a further class label (O) was employed. This, however, corresponded to the case when there was no correction required to be performed for a token. As mentioned above, more than 90% of the words in both the training and test sets belonged to this category. Table 1 shows the distribution of the correction types on both the training and test sets.

Proposed Approach
Our approach consists of a sequence labeling module and relies on lookups from an efficiently indexed n-gram corpus of English tweets. Subsequently, we describe the details of these modules.

Sequence Labeling for Determining Correction Types
As already mentioned in Section 3, the first component in our pipeline was a linear chain CRF (Lafferty et al., 2001). Besides the common word surface forms, such as the capitalization pattern, the first letter or character suffixes, we relied on the following dictionary resources upon determining the features for the individual words: • the SCOWL dictionary being part of the aspell spell checker project containing canonical English dictionary entries, • the normalization dictionaries of Han et al. (2012) and Liu et al. (2012), • the 5,307-element normalization dictionary derived from the portal noslang.com, which map common social media abbreviations to their complete forms.
For each token, word type features were generated along with the word types of its neighboring tokens. The POS tags assigned to each token and its neighboring tokens by the Twitter POS tagger (Gimpel et al., 2011) were also utilized as features in the CRF model. The Twitter POS tag set was useful to us, as it contains a separate tag (G) for multi-word abbreviations (e.g. ily for I love you), which was expected to be highly indicative for the correction type 1toM ABB .
In order to be able to discriminate the M issingW S class, we introduced a feature which indicates for a token t originating from a tweet whether the relation max s∈split(t) f req 1T (s) ≥ τ holds, where τ is a threshold calibrated to 10 6 based on the training set, f req 1T (s) is a function which returns the frequency value associated with a string s according to the Google 1T 5-gram corpus and the function split(t) returns the set of all the possible splits of token t such that its components are all contained in the SCOWL dictionary. For instance split("whataburger ) returns a set of splits including "what a burger", "what a burg er" and "what ab urger". As there is a split (i.e. "what a burger") that is sufficiently frequent according to the n-gram corpus, we take it as an indication that the original token omitted some whitespace characters that we need to inserted.
A CRF model with the above feature set was trained using L-BFGS training method and L1 regularization using CRFsuite (Okazaki, 2007). The overall token accuracy this model achieved was   Table 2 and Table 3. These tables reveal that the most difficult error type to identify was the one where a word missed some whitespace characters (row M issingW S). This class happens to be the least frequent and one of the most heterogeneous class as well, which might be an explanation for the lower results on that class.

Augmented Edit Distance
When determining a set of candidate IV words that an OOV might be rewritten for, it is a common practice to place an upper bound on the edit distance between the IV candidates and the OOV word. In order to measure edit distance between tokens originating from tweets and their corrected forms, we implemented a modification of the standard edit distance algorithm that is especially tailored to measuring the difference of OOV tokens originating from social media to IV ones. The edit distance we employed is asymmetric as insertions of characters into OOV tokens have no costs. For instance, for the words tmrw and to-morrow, the edit distance is regarded as 0 if the former is considered to be the substandard OOV token and the latter one as the standard IV one. Note, however, if the role of the two tokens was changed (i.e if tmrw was treated as IV and tomorrow as OOV), their edit distance would become 4. A further relaxation to the standard edit distance is that we assign 0 cost to the following kinds of phonetically motivated transcriptions: • z → s located at the end of words (e.g. in catz → cats), • a → er located at the end of words (e.g. in bigga → bigger).
By making the above relaxations to the definition of the standard edit distance, we could obtain larger candidate sets for a given edit distance threshold for tokens with higher recall, as we could reduce the edit distance between the OOV words and their appropriate IV equivalent in many cases. Obviously, as the candidate set grows, it might get increasingly difficult to choose the correct normalization from it. However, at this stage of our pipeline, we were more interested in having the correct IV word in the set of candidate normalization, rather than reducing its size.

Making Use of Twitter n-gram Statistics
Our basic assumption was that from the context of an OOV word, it is possible to reconstruct its IV equivalent, as there are users who use the correct IV English form of the OOV word within the same context, e.g. see you tomorrow instead of see u tmrw. The Twitter n-gram frequencies we made use of were the ones that we aggregated over the Twitter n-gram corpus augmented with demographic metadata described in (Herdadelen, 2013).
For a given token t i at position i in a tweet, we chose the most probable corrected form according to the formula where the function C(t i , ct(t i )) returns a set of IV candidates for the token t i , according to ct(t i ), which is the correction type determined for that token by the sequence model introduced in Section 4.1. We indexed the Twitter n-gram corpus with the highly effective LIT indexer (Ceylan and Mihalcea, 2011), which made fast queries of the form t i−1 * t i+1 possible, the symbol * being a   The only case when we did not choose the normalization of an OOV word according to (1) was when there was a unique suggestion for an IV word in the normalization dictionaries we listed in Section 4.1.
The performance of the normalization on the training and test sets, according to the correction types we defined can be found in Table 4 and Table 5, respectively. From these tables, one can see that the worst results were obtained for the correction type when spaces were required to be inserted to a OOV word. This is in accordance with the fact that our sequence model obtained the lowest scores exactly on this kind of corrections. However, due to the fact that this error category is the least frequent, the lower scores on that category does not harm that much our overall performance as can be seen in Table 6 for both the training and test corpora. The results shown in Table 6 also illustrate that our approach seems to generalize well, as there is a small gap between the performances observed on the training and test sets of the shared task.

Training
Test precision 0.8703 0.8606 recall 0.7673 0.7564 F1 0.8156 0.8052 Table 6: Overall performance of our system on the training and test sets

Conclusion
In this paper, we introduced our approach to the lexical normalization of English tweets that ranked second at the shared task among the unconstrained submissions. Our framework first performs sequence labeling over the tokens of a tweet to predict which tokens need to be corrected and in what way. This step is followed by correction typesensitive candidate set generation, from which set the most likely IV normalization of an OOV word is selected by querying an efficiently indexed large n-gram dataset of English tweets.