LYSGROUP: Adapting a Spanish microtext normalization system to English.

In this article we describe the microtext normalization system we have used to participate in the Normalization of Noisy Text Task of the ACL W-NUT 2015 Workshop. Our normalization system was originally developed for text mining tasks on Spanish tweets. Our main goals during its development were ﬂexibility, scalability and maintainability, in order to test a wide variety of approximations to the problem at hand with minimum effort. We will pay special attention to the process of adapting the components of our system to deal with English tweets which, as we will show, was achieved without major modiﬁcations of its base structure.


Introduction
The value of Twitter and other microblogging services as information sources in domains like marketing, business intelligence, journalism, etc. is obvious nowadays. Nevertheless, such amount of information can only be appropriately exploited through text mining techniques.
However, there are notable differences between "standard" language and the so-called texting used in those microtexts. In this kind of writings, it is important to reduce the number of characters used to fit their length restrictions while maintaining the readability of the message to some extent. To achieve this, most of the techniques applied rely on phonetics, thus being language-specific (López Rúa, 2007). For example: intentionally ignoring orthographic and grammar rules, as in "be like" for "am/is/are/was/were like" in the case of English or "asique" for "así que" in the case of Spanish; the usage of shortenings, contractions and abbreviations such as "c u" for "see you" in English or "ksa" for "casa" in Spanish; or the employment of smileys to express emotions, for instance :) to express happiness. These resulting terms are called lexical variants (Han et al., 2013).
The problem is that, in general, text mining tools are very sensitive to those phenomena, as they are designed for dealing with standard texts. Therefore, it is necessary to normalize these texts before their processing, that is, to transform them into standard language. This way "c u nxt week", for example, would be transformed into "see you next week". This is the goal of the W-NUT 2015 Normalization Task (Baldwin et al., 2015).
The rest of this paper is organized as follows: Section 2 describes the core architecture of our system, and how it was adapted to fit this shared task, and Section 3 presents the resources used. Next, Section 4 evaluates the system and discusses the results obtained. Finally, Section 5 presents our conclusions and considers some possible future improvements for our system.

Architecture
Our tweet normalization system was developed taking as basic premises its flexibility, scalability and maintainability. As a starting point, we took a previous prototype for Spanish tweet normalization (Vilares et al., 2013) which, although fully functional, did not turn out to be as flexible and maintainable as expected. This could have become a problem for future developments, since the adaptation effort needed to integrate new techniques would have been too large, so we decided to refactor the whole system to solve this.
The general scheme of the original system mimics that of Han and Baldwin (2011) and comprises three stages: 1. Tweet preprocessing.
2. In-vocabulary word identification (IV), based on the lexicon of the system, obtaining as a result an initial set of out-of-vocabulary words (OOV).
3. OOV set processing in order to distinguish between correct words which are out of the system lexicon and proper lexical variants, obtaining for each one of the latter a normalized form. This last step can be in turn decomposed into two: the first one, which generates a set of possible normalization candidates based on the application of certain normalization techniques; and the second one, which selects one of these candidates as the normalized form (in our case, in a scoredriven process).
As for the particular normalization techniques employed throughout our system, we decided to try first a combination of two of the traditional approximations to this task (Kobus et al., 2008): the spell checking and the automatic speech recognition metaphors.

The pipeline
We decided to give our system an object oriented approach (using JAVA) as opposed to the imperative approach of the original prototype (in PERL).
The new system is structured in processors, formerly known as modules in the prototype, whose goal is to apply a certain process to the input tweets so that we can obtain the normalization candidates of their terms at its output. The core component of our system is the pipeline, consisting of a classic cascade structure where we can insert an arbitrary number of processors and have their inputs and outputs automatically linked. In this way, the original input of the system becomes the input of the first processor, the output of the first processor is the input of the second one, the output of this second processor is the input of the third one, and so on, until reaching the last processor, whose output becomes the output of the system.
Regarding its design, we have followed good engineering practices and made extensive use of design patterns. Among them, it should be noted the use of the decorator pattern which, in our context, represents a simple pipeline, allowing us to dynamically stack an arbitrary number of processors. Its combination with the composition pattern lets us group them into stages, which enable the definition of particular processor sequences while still sharing the same basic processor interface, thus preserving the flexibility of the decorator. Thereby, the resulting structure allows for the dynamic construction of different pipeline configurations of varying complexity and different levels of abstraction, not being restricted to the original settings.
The application of the template pattern allowed us to factorize great part of the common processes of the components, such as the sequential iteration through all the input tweets, which most of the processors perform. This resulted in a great homogenization of the code, thus simplifying maintenance and allowing us to focus our efforts on the specific implementation of the processing methods in each case.
Moreover, some processors make use of external tools capable of being changed even at runtime -something of special interest in multilingual environments. It should also be possible to integrate them into other external components, so that their logic can be reused by others. All this involves decoupling the processors from the specific implementations of the external components employed, which we have achieved through the use of the inversion of control pattern.
Furthermore, communication between the components of the pipeline is done through structured text files, allowing us to gain flexibility as we can integrate and exchange with ease new processing modules regardless of their particular implementation (Vilares et al., 2013). In this case we have used XML along with an implementation of the abstract factory pattern for its construction and parsing. This also facilitates possible future migrations to other data representation languages, such as JSON.
Finally, we have created a dynamic configuration subsystem based on XML files that allows us to define and instantiate the particular structure of the pipeline on which we want to process the tweets. The advantages of such a subsystem are clear, both for system maintainability and testing: 1. It improves the multilingual support of the system by enabling the definition of configurations that use processors and resources designed for a particular language.
2. It allows for experimentation in a simple, agile and documented (the configuration file itself also serves as documentation) manner.
3. It avoids the necessity of modifying the system source code.

Configuration before W-NUT 2015
The current processor configuration for Spanish tweet normalization derives from that one used by the initial prototype for its participation in the TweetNorm 2013 task (Alegría et al., 2013). The general procedure works like this: firstly, using processors to prepare the input (preprocessing); secondly, employing those whose purpose is to obtain new normalization forms (candidates generation); thirdly, using those in charge of selecting or filtering the best normalization forms (candidate filtering/selection); and lastly, employing those which prepare the final output of the system (postprocessing). Such setup includes the following processors: • FreelingProcessor, which reads the input data in the TweetNorm 2013 format and uses Freeling (Padró and Stanilovsky, 2012) to perform the tokenization, lemmatization and POS tagging (although these tags are not currently in use) of the text of the tweet.
• MentionProcessor, HashtagProcessor, URLProcessor and SmileyProcessor, which act as filters for OOVs we do not want to consider for normalization.
• PhoneticProcessor, which uses a phonetic table to map characters to their phonetic equivalent strings, such as "x" to "por". 1 • SMSDictionaryProcessor, which looks for normalization candidates in an SMS dictionary, for example "también" (too/also) for "tb".
• AspellProcessor, which obtains normalization candidates using the spell checker aspell (Aspell, 2011), as in "polémica" (controversy) for "polemik". It should be noted that this tool has been customised with a new phonetic table for Spanish, based on the Metaphone algorithm (Philips, 1990) and a new Spanish dictionary extracted from Wikimedia resources. 2 • AffixESProcessor, which identifies and normalizes affix-derived Spanish forms of base words, also supporting phonetical writing, as in the case of "chikiyo" for "chiquillo" (little boy), obtained from "chico" with the suffix "-illo" (little/small).
• NGramProcessor, which calculates the scores of those most likely normalization candidates according to the Viterbi algorithm (Manning and Schütze, 1999, Ch. 9) taking as reference the Web 1T 5-gram v1 (Brants and Franz, 2006) Spanish language model.
• CandidateProcessor, which selects the top-scoring candidate for each word.
• ResultProcessor, which dumps the tweet data obtained by the system to a file using the required format.

Adaptation for W-NUT 2015
In general, the adaptation process revolved around implementing new processors and integrating new resources to account for the requirements of this new task, such as the use of English instead of Spanish on the new I/O data format, while leaving the base structure of the system untouched. This was precisely the main goal during the refactoring process at the beginning of this project. The resulting configuration includes the following new processors (see Section 3 for a description of the resources they use): • WNUTTweetProcessor, which parses the structured input (now in JSON format instead of plain text) and obtains the system representation of the tweets.
• ArkTweetProcessor, which uses the ark-tweet-nlp POS tagger to obtain the morphosyntactic information of the input tweet tokens.
• WNUTFilterProcessor, which filters out all those terms that should not be normalized according to the task rules (mentions, hashtags, URLs, etc.) using regular expressions.
• LowerCaseProcessor, which takes all the candidate forms of a token and lowercases them; AspellCProcessor, a constrained version of the original AspellProcessor described in Section 2.2 (see Section 3 for further details).
• WNUTNgramProcessor, which is similar to the previous NGramProcessor but with some added modifications to fit the particularities of our new custom language model.
• WNUTResultProcessor, which dumps all tweet data generated by the system in the required output format (JSON).
We show in Figure 1 a graphical representation of the architecture of the system both before (left side) and after (right side) the adaptation. Unfortunately, time limitations prevented us from implementing an English phonetic table for the PhoneticProcessor, which would have provided us with mappings such as "two", "too" or "to" for "2". To alleviate this, we did extend the SMS dictionary to cover some of these cases.
It should be noted that because of those limitations we did not address those cases were multiple contiguous tokens of the input tweet should be normalized into a single output token (i.e. the so called "n-1 mappings"). Moreover, since that phenomenon was rare (it appeared in just 11 tweets out of 2950 of the training dataset) we considered that leaving this feature behind would have little impact on the final performance of the system.

Integrated resources
The base resources we have used for this task, and on which most of the system processors rely, are the following: • aspell (Aspell, 2011), the well-known spell-checker together with its default English dictionary.
• BerkeleyLM (Pauls and Klein, 2011), a Java library and toolset focused on language modeling.
• Redis, 3 a noSQL key-value datastore; and the SMS normalization dictionaries, canonical lexicon and training dataset provided by the organizers of the task.
As a result of processing the previous resources, we have obtained the following additional ones: • A global SMS normalization dictionary implemented as a Redis datastore, whose entries were extracted from the two normalization dictionaries and the training dataset provided by the organizers.
• A Kneser-Ney language model (Kneser and Ney, 1995) of the target domain (standard tweet text) obtained with the BerkeleyLM tools taking as input tweets of the training dataset.
• A new English dictionary for aspell built on the canonical lexicon.
With respect to the differences existing between the configurations of the system for constrained and unconstrained runs, there is only one. In the case of the constrained run, since only offthe-shelf tools are permitted, the aspell spellchecker was employed using its default dictionary but filtering its retrieved candidate corrections taking as reference the canonical lexicon; i.e. only those candidates that could be found on this lexicon were taken into account. On the other hand, in the case of the unconstrained run, aspell was used instead with the dictionary obtained from the canonical lexicon. The rest of the processors and their parameters remained the same. Moreover, although we also considered the use of the Web 1T 5-gram v1 language model in the unconstrained run, our preliminary tests showed that the results obtained were very poor in this case, as we further comment in Section 4. Table 1 shows the results obtained for the training corpus. It should be noted that these correspond to a slightly overfitted system, since we inadvertently used a language model built using the whole training dataset (for candidate selection) in our 10-fold cross-validation framework. Nevertheless, this also gave us an interesting clue to the main performance bottleneck of our system, as we will discuss below.    Table 2 shows the results obtained for the test corpus. At the sight of these figures, which differ considerably from the previous ones, we decided to analyse them in more detail. For this purpose, language model in use.

Evaluation
In this respect, tuning experiments were also made by extending our unconstrained configuration through the addition of the Web 1T 5-gram v1 English language model as a knowledge source. Only unigrams and bigrams could be used because of unsolved memory limitations. However, in contrast with previous experiments performed for Spanish, the resulting performance was unsatisfactory. Because of this, the use of these language models for our final submission was dismissed. According to our analysis, the cause for this seems to be the great differences, at both the lexical and syntactical levels, between the texts used to build this model, which could be considered as "regular" texts, and those corresponding to tweets, which agrees with the observations of Chrupała (2014). As illustrative examples of this type of expressions we can take "I like them girls" and "Why you no do that?", which are lexically correct but not syntactically valid, so language models built using regular texts will not recognize them. In the case of our previous experiments on Spanish, this difference was not so clear.

Conclusions and Future work
We have presented in this work the tweet normalization system used by our group to participate in the W-NUT 2015 Normalization Task which, in turn, is an adaptation of another existing Spanish tweet normalization system.
Within the scope of this task, it became clear that most of the normalization mistakes made by our system occurred during the candidate selection stage, as it was unable to determine the correct normalization term obtained in previous stages from the set of candidates available. The reason for it is that we do not have at this very moment enough training data to build a representative language model of the target domain (normalized text of English tweets).
Furthermore, there is another type of normalization phenomena which, at this moment, cannot be correctly handled by our system: n-1 mappings. This is due to the initial approach we took for this system, which only considered 1-1 and 1n mappings, but not n-1 mappings, together with our time limitations.
All that being said, as future lines of work we are considering the following improvements to our system: • Obtaining a representative language model of the target domain by using a larger normalized tweet corpus. This corpus will be comprised of tweets without non-standard words, so we can still capture the morphosyntactic structure of these texts (Yang and Eisenstein, 2013).
• Using POS tags and syntactic information to improve the candidate selection process.
• Integrating a classifier in the extraction process of the final normalization candidates, taking as features aspects such as the syntactic and morphosyntactic information obtained, their probability according to the language model, whether they were selected or not by the Viterbi algorithm, their string and phonetic differences with respect to the original form, etc.
• Keeping the canonical lexicon updated using resources like Wikipedia, since the language model construction process relies heavily upon a good lexical reference in order to correctly discard non-standard words.
Moreover, we intend to study the application of tweet normalization, for both Spanish and English tweets, in opinion mining tasks (Vilares et al., 2015).