A Dataset and Classifier for Recognizing Social Media English

While language identification works well on standard texts, it performs much worse on social media language, in particular dialectal language—even for English. First, to support work on English language identification, we contribute a new dataset of tweets annotated for English versus non-English, with attention to ambiguity, code-switching, and automatic generation issues. It is randomly sampled from all public messages, avoiding biases towards pre-existing language classifiers. Second, we find that a demographic language model—which identifies messages with language similar to that used by several U.S. ethnic populations on Twitter—can be used to improve English language identification performance when combined with a traditional supervised language identifier. It increases recall with almost no loss of precision, including, surprisingly, for English messages written by non-U.S. authors. Our dataset and identifier ensemble are available online.


Introduction and Related Work
Language identification is the task of determining the major world language a document is written in. A range of supervised classification methods-often based on character ngram features-achieve excellent performance for this problem on long, monolingual documents (Hughes et al., 2006). But short documents are much more challenging, such as Twitter messages Baldwin, 2012, 2014;Bergsma et al., 2012;Williams and Dagli, 2017 Compounding the challenge is domain mismatch: the types of casual language, dialectal language, and Internet-specific constructs found in social media are often not present in the standardized genres of training data for existing language identifiers. This is potentially especially problematic for language by minority dialect speakersfor example, Blodgett et al. (2016) found that current language identification models had lower recall for tweets written in African-American English (AAE) than those in standard English. This is not surprising given the domain mismatch-a survey of recent language identifiers shows that common sources of training data are Wikipedia, newswire (e.g. the Leipzig corpora), and government and legal documents such as EuroGov, Eu-roParl, or the Universal Declaration of Human Rights (Lui and Baldwin, 2012;King and Abney, 2013;Jaech et al., 2016;Kocmi and Bojar, 2017;Lui and Cook, 2013). A language identification system typically aims to classify messages as one of a few hundred major world languages, which are generally wellresourced mainstream language varieties with officially recognized status by major political entities; these language varieties typically have official ISO 639 codes assigned to them (which are returned by language identification software APIs). 2 Given the high linguistic diversity of messages in social media, it is tempting to imagine fine-grained dialect identification (for example, identifying messages written in AAE), but at the same time, the traditional task of identifying major world languages will continue to be useful (for example, an AAE message could be reasonably analyzed with general English language technologies). In this work we maintain the paradigm of treating English as a broad language category, but propose that the texts that match it ought to be broadened to include nonstandard, social media, and dialectal varieties of English.
If there was abundant language-annotated Twitter data, it would be straightforward to train an indomain language identifier. But very little exists, since it is inherently time-consuming and expensive to annotate. Datasets are typically small, or semi-automatically tagged (Bergsma et al., 2012), which may bias them towards pre-existing standardized language.
A promising approach is to leverage large quantities of non-language-labeled tweets to help adapt a standard identifier to perform better on social media. If the messages are treated as unlabeled, this could be framed as unsupervised domain adaptation problem, for which a number of approaches are available (Blitzer et al., 2006(Blitzer et al., , 2007Plank, 2009;Yang and Eisenstein, 2016).
We focus on a unique, and different, large-scale training signal-U.S. neighborhood-level demographics. There is considerable linguistic diversity within the U.S., and its geographic patterns have some rough correlation with different ethnic and race populations. Blodgett et al. analyzed them with a mixed membership model-for which messages written by authors living in areas heavy in a particular demographic group were more likely to use a unigram language model associated with that group-in order to focus on AAE. But they note their model found that non-English language tended to gravitate towards one of the latent language models, which was useful to better identify English spoken within the U.S. that a standard identifier missed.
We hypothesize that this generalizes beyond specific dialect populations within the U.S., testing whether this soft signal from the demographic model actually gives a better model of overall social media English. We evaluate as fairly and completely as possible; we first annotate a new dataset of uniformly sampled tweets for whether they are English versus non-English ( §2). In §3, we apply Blodgett et al.'s model to infer U.S. demographic language proportions in new tweets, finding that when added as an ensemble to a pre-existing identifier, performance improvesincluding when paired with feature-based, neural network, and proprietary identifiers. Such ensembles perform better than in-domain training with the largest available annotated Twitter dataset, and also better than a self-training domain adaptation  approach on the same dataset used to construct the demographic language model-and the accuracy increases for English messages from many different countries around the world.

Dataset and Annotation
We sampled 10,502 messages from January 1, 2013 to September 11, 2016 from an archive of publicly available geotagged tweets. We annotated the tweets with three mutually exclusive binary labels: English, Not English, and Ambiguous. These tweets were further annotated with descriptive labels: • Code-switched: Tweets containing both text in English and text in another language.
• Ambiguous due to named entities: Tweets containing only named entities, such as Vegas!, and therefore whose language could not be unambiguously determined.
We excluded any usernames and URLs in a tweet from the judgment of the tweet's language, but included hashtags. Tables 1 and 2 contain the statistics for these labels in our annotated dataset. For all our experiments, we evaluate only on the subset of messages in the dataset not labeled as ambiguous or automatically generated, which we call the evaluation dataset. 57 3 Experiments

Training Datasets
We investigate the effect of in-domain and extra out-of-domain training data with two datasets. The first is a dataset released by Twitter of 120,575 tweets uniformly sampled from all Twitter data, which were first labeled by three different classifiers (Twitter's internal algorithm, Google's Compact Language Detector 2, and langid.py), then annotated by humans where classifiers disagreed. 3 We reserve our own dataset for evaluation, but use this dataset for in-domain training. This dataset is only made available by tweet ID, and many of its messages are now missing; we were able to retrieve 74,259 tweets (61.6%). For the rest of this work, we call this the Twitter70 dataset (since it originally covered about 70 languages). In addition, following Jaech et al. (2016), we supplemented Twitter70 with out-of-domain Wikipedia data for 41 languages, 4 sampling 10,000 sentences from each language.

Classifiers
We tested a number of classifiers on our annotated dataset trained on a variety of domains, and in some cases retrained.
• CLD2: a Naive Bayes classifier with a pretrained model from a proprietary corpus; it offers no support for re-training.
• Twitter: the output of Twitter's proprietary language identification algorithm.
• Neural model: a hierarchical neural classifier that learns both character and word representations. It provides a training dataset with 41,250 Wikipedia sentence fragments in 33 languages (Jaech et al., 2016). 5 Self-training We experimented with one simple approach to unsupervised domain adaptation: selftraining with an unlabeled target domain corpus (Plank, 2009) by using langid.py to label the corpus of tweets-released by Blodgett et al. 6 and the same one used to train their demographic modelthen collecting those tweets classified with posterior probability greater than or equal to 0.98. We downsampled tweets classified as English to 1 million, yielding a total corpus of 2.2 million tweets. Since we did not have access to langid.py's original training data, we trained a new model on this data, then combined it as an ensemble with the original model, where a tweet was classified as English if either component classified it as English.
Demographic prediction ensemble Blodgett et al. describes applying a U.S. demographicallyaligned language model as an ensemble classifier, using a mixed membership model trained over four demographic topics (African-American, Hispanic, Asian, and white). For this classifier, tweets are first classified by an off-the-shelf classifier; if it is classified as English, the classification is accepted. Otherwise, the off-the-shelf classifier is overriden and the tweet classified as English if the total posterior probability of the African-American, Hispanic, and white topics under the demographic model was at least 90%. Table 3 lists these ensembles as "+ Demo". Blodgett et al. found the classifier seemed to improve recall, but this work better evaluates the approach with the new annotations.

Length-Normalized Analysis
From manual inspection, we observed that longer tweets are significantly more likely to be correctly classified; we investigate this length effect by grouping messages into five bins (shown in Table 6) according to the number of words in the message. We pre-processed messages by fixing HTML escape characters and removing URLs, @mentions, emojis, and the "RT" token. For each bin, we calculate recall of the langid.py and the demographic ensemble classifier with langid.py.

Results and Discussion
We evaluated on the 8,366 tweets in our dataset that were not annotated as ambiguous or automatically generated.    Table 5: Sample of tweets which were mis-classified as non-English by langid.py but correctly classified by the demographic ensemble. @-mentions are shown as @username for display in the table.
Unsurprisingly, we found that training on Twitter data improved classifiers' English recall, compared to their pre-trained models. In our experiments, we found that recall was best when training on the subset of the Twitter70 dataset containing only languages with at least 1,000 tweets present in the dataset. We also found that the additional information provided by the demographic model's predictions still adds to the increased performance from training on Twitter data. Notably, precision decreased by no more than 0.4% when the demographic model is added.
We also noted that pre-processing improved recall by 1 to 5%.
Proprietary algorithms We found that neither CLD2 nor Twitter's internal algorithm was competitive with langid.py out of the box, in line with previous findings, but combining their predictions with demographic predictions did increase recall. 7 langid.py Self-training langid.py produced little change compared to the original pre-trained model (rows (5) vs. (7)), despite its use of 2.2 million new tweets from self-training step. We observed that even tweets that langid.py classified as non-English with more than 0.98 posterior probability were, in fact, generally English. This suggests that tweets are sufficiently different from standard training data that it is difficult for self-training to be effective. In contrast, simple in-domain training was effective: retraining it with the Twitter70 dataset achieved substantially better recall with a 5.4% raw increase compared to its out-of-domain original pretrained model (rows (5) vs. (9)).
In all cases, regardless of the data used to train the model, langid.py's recall was improved with the addition of demographic predictions; for example, the demographic predictions added to the pre-trained model brought recall close to the model trained on Twitter70 alone, indicating that in the absence of in-domain training data, the demographic model's predictions can make a model competitive with a model that does have indomain training data (rows (8) vs. (9)). Of course, in-domain labeled data only helps more (10).
Neural model Finally, the neural model performed worse than langid.py when trained on the same Twitter70 dataset (rows (9) vs. (15)), and its performance lagged when trained on its provided dataset of Wikipedia sentence fragments. 8 As with the other models, demographic predictions again improve performance. Table 5 shows a sample of ten tweets misclassified as non-English by langid.py and correctly classified by the demographic ensemble as English. Several sources of potential error are evident; many non-conventional spellings, such as partyyyyy and watchinf, do not challenge an English reader but might reasonably challenge character n-gram models. Similarly, common social abbreviations such as hml and fr are challenging.

Improving English Recall Worldwide
We further analyzed our English recall results according to messages' country of origin, limiting our analysis to countries with at least 100 non-ambiguous, non-automatically generated messages in our dataset. For each country's messages, we compared the recall from best standalone langid.py model (trained on Twitter70) and the recall from same model combined with demographic predictions, as shown in Table 4. Surprisingly, for ten of the fifteen countries we found that using demographic predictions improved recall performance, suggesting that the additional soft signal of "Englishness" provided by the demographic model aids performance across tweets labeled as English globally. In future work, we would like to investigate linguistic properties of these non-U.S. English tweets.  Table 6: Percent of the messages in each bin classified correctly as English or non-English by each classifier; t is the message length for the bin.

Improving Recall for Short Tweets
Our results from the length-normalized analysis, shown in Table 6, demonstrate that recall on short tweets, particularly short English tweets, is challenging; unsurprisingly, recall increases as tweet length increases. More importantly, for short tweets the demographic ensemble classifier greatly reduces this gap; while the difference in langid.py's recall performance between the shortest and longest English tweets is 16.5%, the difference is only 5.6% for the ensemble classifier. The gap is similarly decreased for non-English tweets. We note also that precision is consistently high across all bins for both langid.py and the ensemble classifier. The experiment indicates that the demographic model's signal of "Englishness" may aid performance not only for global varieties of English, but also for short messages of any kind.