DCU-ADAPT: Learning Edit Operations for Microblog Normalisation with the Generalised Perceptron

We describe the work carried out by the DCU-ADAPT team on the Lexical Nor-malisation shared task at W-NUT 2015. We train a generalised perceptron to annotate noisy text with edit operations that normalise the text when executed. Features are character n -grams, recurrent neural network language model hidden layer activations, character class and eligibility for editing according to the task rules. We combine predictions from 25 models trained on subsets of the training data by selecting the most-likely normalisation according to a character language model. We compare the use of a generalised perceptron to the use of conditional random ﬁelds restricted to smaller amounts of training data due to memory constraints. Furthermore, we make a ﬁrst attempt to verify Chrupała (2014)’s hypothesis that the noisy channel model would not be useful due to the limited amount of training data for the source language model, i.e. the language model on normalised text.


Introduction
The W-NUT Lexical Normalisation for English Tweets shared task is to normalise spelling and to expand contractions in English microblog messages (Baldwin et al., 2015). This includes one-tomany and many-to-one replacements as in "we're" and "l o v e". Tokens containing characters other than alphanumeric characters and the apostrophe are excluded from the task, as well as proper nouns and acronyms that would be acceptable in welledited text. (The input, however, does not identify such tokens and unnecessarily modifying them is penalised in the evaluation.) To make evaluation easier, participants are further required to align output tokens to input to-kens, e.g. when the four tokens "l", "o", "v" and "e" are amalgamated to the single token "love", three empty tokens must follow in the output. This is easy for approaches that process the input token by token but may require extra work if the input string is processed differently.
We participate in the constrained mode that allows off-the-shelf tools but no normalisation lexicons and additional data to be used. Furthermore, we do not use any lexicon of canonical English but learn our normalisation model purely from the provided training data.
Our approach follows previous work by Chrupała (2014) in that we train a sequence labeller to annotate edit operations that are intended to normalise the text when applied to the input text. However, while Chrupała uses conditional random fields for sequence labelling, we further experiment with using a generalised Perceptron and with using a simple noisy channel model with character n-gram language models trained on the normalised side of the training data to select the final normalisation from a set of candidate normalisation generated from an ensemble of sequence labellers and from selectively ignoring some of the proposed edit operations.

Data Set and Cross-validation
The microblog data set of the shared task contains 2,950 tweets for training and 1,967 tweets for final testing. Each tweet is tokenised and the tokens of the normalised tweets are aligned to the input, allowing for one-to-one, many-to-one and one-tomany alignments.
For five-fold cross-validation, we sort the training data by tweet ID and split it into 5 sets of roughly the same number of tokens. (The number of tweets varies from 579 to 606.) Systems are trained on four sets and tested on the remain-ing set. Since the sequence labellers require a development set, we split the union of the four sets again into 5 sets to carry out nested crossvalidation, training 25 models in total for each system.

Feature Extraction
For extracting recurrent neural network language model features, we use Elman 1 (Chrupała, 2014), a modification of the RNNLM toolkit 2 (Mikolov et al., 2010;Mikolov, 2012) that outputs hidden layer activations. We use the off-the-shelf model from Chrupała (2014) 3 . The input are the characters of the tweet 4 in one-hot encoding. The network has a hidden layer with 400 neurons and it predicts the next byte. Following Chrupała (2014), we reduce the 400 activations to 10 binary features: We select the 10 most active neurons in order and apply a threshold (0.5) to the activation. The value of the i-th feature expresses which neuron was i-th active and whether its activation was below 0.5, e.g. the first feature states which neuron is most active and whether or not its activation is below 0.5. As there are 400 neurons and 2 possible binarised activations, there are 800 possible values. 5 Edit operations are extracted from the parallel training data searching for the lowest edit distance and recording the edit operations with dynamic programming. We customise the edit costs function to always postpone insertions to after deleting characters so that each input character can be assigned exactly one edit operation from the set {do nothing, delete character, insert string before character}. To capture insertions at the end of the tweet, we append a NULL byte to all tweets.
The above setup, features and edit operations are identical to Chrupała (2014) to the best of our knowledge. We further add a character class feature {NULL, control, space, apostrophe, punctuation, digit, quote, bracket, lowercase letter, uppercase letter, non-ASCII, other} and a feature indicating whether the character is part of a token that is eligible for editing according to the shared task 1 https://bitbucket.org/gchrupala/elman 2 http://rnnlm.org/ 3 https://bitbucket.org/gchrupala/ codeswitch/overview 4 More precisely, we process UTF-8 bytes. For the training data, this is the same as characters as the training set does not contain any multi-byte UTF-8 characters.
5 These RNN-LM hidden layer activation features have been used successfully in text segmentation and word-level language identification (Chrupała, 2013;Barman et al., 2014). rules, i.e. whether or not the characters encountered since the last space or start of tweet only are letters, digits, apostrophes and spaces.

Sequence Labelling
For character-level sequence labelling, we try (a) Sequor 6 (Chrupała and Klakow, 2010), an implementation of the generalised perceptron (Collins, 2002), 7 with 10 iterations, and (b) Wapiti 8 (Lavergne et al., 2010)'s implementation of conditional random fields (Lafferty et al., 2001) using l-bfgs optimisation with a history of 5 steps, elastic net regularisation (ρ 1 = 0.333 and ρ 2 = 0.001) and no hard limit on the number of iterations. We extend the feature templates of Chrupała (2014) 9 by including our additional two features. The template generates unigram, bigram and trigram character features within a +/-2 window. All remaining features are included as unigrams of the current value.
Due to the nested cross-validation (see above), Sequor is trained on 64% (0.8 2 ) of the training data, 16% (0.8 × 0.2) is used as development set and 20% (1/5) for testing. For Wapiti, we use only 16% for training (and the remaining 64% for development set) in each cross-validation fold due to memory constraints. 10

Generating Candidates
We produce candidate normalisations from the edit operations proposed by the sequence model. However, if we allowed each insert and delete operation to be either realised or not, we would produce up to 2 N candidates, where N is the number of edit operations. With N = 140 (maximum lengths of a tweet), handling these many candidates is not feasible. Instead, we recursively split the sequence of edit operations produced by the sequence labeller into up to eight sections. To find good split points, we propose to minimise (1) 6 https://bitbucket.org/gchrupala/ sequor 7 The generalised perceptron has been shown to match performance of state-of-the-art methods in word segmentation, POS tagging, dependency parsing and phrase-structure parsing (Zhang and Clark, 2011 where e L and e R are the number of insert or delete operations to the left and right respectively, and s is the number of consecutive no-operations to the left. The first term tries to balance the number of edit operations on each side while the second term introduces a preference to not split clusters of edit operations.
For each section, we either use the edit operations produced by the sequence labeller or do not edit the section. As we split each sequence into no more than eight sections, we produce up to 2 8 = 256 candidates. 11 Only one candidate, identical to the input, will be produced if there are no delete or insert operations and two candidates will be produced if there is just one delete or insert operation.
In training, we may potentially produce up to 5×256 = 1,280 candidates per tweet as the nested cross-validation gives us five sequence labellers per cross-validation run. During testing, up to 25 × 256 = 6,400 candidates may be produced. (The actual maximum number of candidates may be lower when labellers agree on the edit operations.)

Applying Edit Operations
After producing candidate edit operation sequences that use subsets of the edit operations predicted by a sequence model, the edit operations are executed to produce candidates strings for the normalised tweets. As the shared task asks for tokenised output aligned to the input tokens, we apply the edit operations to each token in the following sequence: 1. Apply all edit operations at character positions that correspond to input tokens.
2. Apply insert operations recorded at the space between tokens and at the end of the tweet to the preceding token.
3. Apply delete operations at the space between tokens, moving the contents of the token to the right to the end of the token to the left, leaving behind an empty token. (Delete operations at the end-of-tweet marker are ignored.) Due to time constraints, we do not attempt to improve the alignment of output tokens to input tokens. 11 Splitting the eight sections again would produce 2 16 = 65,536 candidates.

Language Modelling
For language modelling, we train SRILM (Stolcke, 2002) on the normalised tweets of the training data. As we want to build character n-gram models and SRILM has no direct support for this, we re-format the candidate strings to make each character a token. To distinguish space characters from token separators, we represent them with double underscores.

Candidate Selection
We use the noisy channel model 12 to select the most plausible sourceŝ for the observed target t from the set of candidates S(t): arg max s∈S(t) P (t|s)P (s) (2) P (s) is provided by the language model (Section 2.6). Standard models give high probability to making few or no edits. However, we trust our sequence models as Chrupała (2014) reported encouraging results. Therefore, we give high probability to using the predicted edit operations. We consider two models for P (t|s): 0 otherwise (4) Note that P 1 is not a proper probability model as there is never exactly one "otherwise" case but 2 i − 2 cases where i is the number of sections considered in candidate generation, causing the total to be either 0.999 or between 1.001 and 0.999 + 0.001 × (2 8 − 2) = 1.253. P 2 effectively excludes the original input and all candidates that use only some but not all of the edit operations suggested by the sequence labellers. Since there are five sequence labellers per cross-validation fold due to nested cross-validation and 25 sequence labellers during testing, P 2 effectively selects between 5 or 25 candidates. 13  Table 1: Average language model perplexity over the five cross-validation runs for n-gram sizes n = 2, ..., 6 and smoothing methods WB = Witten-Bell, KN = Keyser-Ney and GT = Good-Turing. Standard deviation σ ≤ 0.23 for all configurations.

Evaluation Measures
We evaluate our best systems using the evalution script provided by the shared task organisers. It counts: • The number of correctly modified tokens, i.e.
tokens that need to be replaced by a new nonempty token and the system correctly predicts this token.
• Number of tokens needing normalisation, i.e. tokens that are modified in the gold output. However, again, tokens that are to be deleted are ignored, e.g. "l o v e" to "love" counts as one event only despite the replacement of three tokens with empty tokens.
• The number of tokens modified by system, i.e. tokens for which a substitution with a non-empty token is proposed by the system.
Based on these numbers, precision, recall and F1score are calculated and we select the system and configuration to be used on the test set based on highest average F1-score over the 5 crossvalidation runs.

Results
We use character n-gram language models in the noisy channel model for candidate selection. To address sparsity of data that arises when test sentences contain n-grams that are rare or unseen in the training data, we try Witten-Bell, Keyser-Ney and Good-Turing smoothing. Table 1 shows average cross-validation perplexity for these three smoothing methods and n = 2, ..., 6. Over all five cross-validation folds, the language model that gives the lowest perplexity when trained n-best list before applying more complex models to tokenlevel candidate selection.  Table 2: Average cross-validation results over the five cross-validation runs for transition models P 1 and P 2 , W = Wapiti CRF sequence labeller (trained on only 16% of the training data), S = Sequor generalised perceptron sequence labeller (trained on 64% of the training data), P = precision, R = recall, F1 = F1 measure. Standard deviation σ ≤ 0.03 for all cells.
on the training data and applied to the internal test set is the 6-gram model with Witten-Bell smoothing. This confirms the recommendations in the SRILM documentation to use Witten-Bell smoothing when the vocabulary is small such as when building a character language model. Table 2 shows cross-validation results for the four systems resulting from the choices between transition models P 1 and P 2 and using the Wapiti CRF or the Sequor generalised perceptron sequence labeller. The differences are not large in precision but for recall, the model P 1 performs poorly. Also the CRF consistently has lower recall than the respective perceptron model. Interestingly, the CRF achieves best precision. On F1score, the best result is obtained with model P 2 , which reduces the noisy channel model to selection between sequence modeller hypotheses, together with the Sequor sequence modeller.
On the final test set, our best system using P 2 and Sequor has precision 81.90%, recall 55.09% and F1 65.87%, placing it fifth out of six submissions in the "constrained" category.

Discussion
A possible explanation for the low recall obtained with the P 1 model is that this noise model cannot counter the effect that shorter sentences generally receive higher language model probability scores and therefore there is a tendency to reject edit operations that insert additional characters.
Furthermore, we observe that our system often assigns inserted text to the wrong evaluation units, e.g. inserting the string " laughing out" before the space before "lol" and then replacing second "L" of "lol" with "ud". This is not wrong on the string level, but in the token-level evaluation, we make two errors: wrongly appending " laughing out" to the previous token and wrongly normalising "lol" to just "loud" instead of "laughing out loud".
Since the model P 1 did not come out best, we cannot reject Chrupała (2014)'s hypothesis that the noisy channel model would not be useful. However, our observations also do not provide much support for this hypothesis as we did not include standard models from previous work (Cook and Stevenson, 2009;Han et al., 2013) in our experiment.

Conclusions
We trained two sequence modellers to predict edit operations that normalise input text when executed and experimented with applying the noisy channel model to selecting candidate normalisation strings.
Future work should: • Train the CRF on the full training data, either using a more memory-friendly (but possibly slower) optimisation method or using an even larger machine.
• Combine models with voting rather than language model score.
• For the noisy channel model, try standard models from previous work (Cook and Stevenson, 2009;Han et al., 2013).
• To better understand the selection preferences of the noisy channel model, compare the F1-score obtained when evaluating against the gold data to the F1-score obtained when evaluating the system output against its own input, i.e. are we biased towards doing nothing?
• Introduce a brevity penalty to counter the effect of selecting short candidate normalisations in the noisy channel model.
• Automatically revise the alignment to input token according to global co-occurrence statistics.
• Carry out a full error analysis of what the system does well and where it fails.