Twitter Language Identification Of Similar Languages And Dialects Without Ground Truth

We present a new method to bootstrap filter Twitter language ID labels in our dataset for automatic language identification (LID). Our method combines geo-location, original Twitter LID labels, and Amazon Mechanical Turk to resolve missing and unreliable labels. We are the first to compare LID classification performance using the MIRA algorithm and langid.py. We show classifier performance on different versions of our dataset with high accuracy using only Twitter data, without ground truth, and very few training examples. We also show how Platt Scaling can be use to calibrate MIRA classifier output values into a probability distribution over candidate classes, making the output more intuitive. Our method allows for fine-grained distinctions between similar languages and dialects and allows us to rediscover the language composition of our Twitter dataset.


Introduction
Every second, the Twitter microblogging webservice relays as many as 6,000 1 short written messages (less than 140 characters), called tweets, from people around the world. The tweets are created and viewed publicly by anyone with internet access. Tweets obtained from the Twitter API are tagged with metadata such as language ID and geo-location (Graham et al, 2014). * This material is based upon work supported by the Defense Advanced Research Projects Agency under Air Force Contract No. (FA8721-05-C-0002 and/or FA8702-15-D-0001). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency. 1 http://www.internetlivestats.com/twitter-statistics/ Currently there is a mismatch between the builtin language identification support provided by the Twitter API and the needs of the natural language processing (NLP) community. While there are around 7,000 2 human languages spoken today, only 34 of the most common languages are currently recognized and tagged by Twitter 3 using automatic methods for language identification (LID). In addition to Twitter's low-coverage of languages, Twitter's default language tags are not always accurate (Zubiaga et al, 2015;Lui and Baldwin, 2014;Bergsma et al, 2012) making it very challenging to obtain the necessary groundtruth for training a language classifier.
Twitter data is linguistically diverse and has tremendous global reach and influence. Discriminating languages and dialects automatically is a critical pre-processing step for more advanced NLP applications (Dagli et al, 2016). Heavy, worldwide use of Twitter has created a very rich landscape for developing NLP applications such as support for disaster relief (Sakaki et al., 2010;Kumar et al., 2011), sentiment analysis (Volkova et al., 2013), as well as recognizing named entities (Ritter et al., 2011) and temporal reasoning for events and habits (Williams and Katz, 2012).
In this work we show how geo-location can be used to identify the language of a tweet when appropriate language tags are seemingly incorrect, or absent. Specifically, we are interested in discriminating similar languages English, Malay and Indonesian (en, ms, id) as well as dialects of Spanish from Europe and Mexico (es-ES, es-M X) and dialects of Portuguese from Europe and Brazil (pt-P T, pt-BR). Language names are represented using the ISO-639-2 language codes and 2-letter country abbreviation added for dialects. The methods we present in this paper provide a fast, low-cost approach to filtering Twitter LID la-bels. It is very important to have data with reliable language labels because it allows us to make finegrained distinctions between dialects and similar languages, in order to expand the linguistic scope of NLP applications. This paper is organized as follows: Section 2 describes related work, Section 3 describes the data collection and preparation, Section 4 describes classification algorithms, Section 5 shows our re-annotation experiments and results, Section 6 presents results using Platt Scaling, and finally Section 7 is discussion and future work.

Related Work
Language identification has a rich history in natural language processing (Cavnar and Trenkle, 1994;Dunning, 1994). Recently, many different language combinations have appeared in benchmark shared tasks, most notably in the DSL (Discriminating Similar Languages) Shared Task 2014(Lui et al, 2014Zampieri et al, 2015, Malmasi et al, 2016. In these shared-tasks the train/test data is not composed entirely of social media while simultaneously providing support for the languages and dialects that we are interested in. Additionally, English is sometimes used by Twitter users within the country geo-boundaries of Indonesia and Malaysia. Therefore we cannot rely on user profile settings as in previous work (Saloot et al., 2016), including Kevin Scannell's ongoing Indigenous Tweets Project 4 which relies on self-reported minority language usage but does not guarantee homogeneity of labeled language collections. Ranaivo-Malançon (2006) was the first to work on Malay-Indonesian LID using n-gram profiling and other linguistic features. While their work capitalizes on nuanced linguistic differences between Malay and Indonesian, it does not address whether or not this technique can be expanded to include English, or dialect pairs, and the results for classifier accuracy are not reported. We are also interested in discriminating dialects of Spanish and Portuguese, as these are widely spoken languages with important dialect distinctions (Zampieri et al, 2016;Çöltekin and Rama, 2016).
The 2014 DSL Shared-Task was the first large-scale task for distinguishing between similar languages and dialects in a language group, including: Malay/Indonesian, Brazilian Por-4 http://indigenoustweets.com/ tuguese/Portuguese, and Spanish/Mexican Spanish. The data for this shared-task, compiled by , was collected from the web, cleaned, and consists of 18,000 training sentences per language group. Performance results per language group are reported for the top 8 systems, with the best performing system, NRC-CNRC (Goutte et al, 2014), achieving overall accuracy between 91%-99% on the language groups that we are interested in. Our work is distinct from the DSL Shared-Tasks for language and dialect identification because we are interested in learning a classifier using only Twitter data, without ground truth, using very few training examples.

Data Collection
We collected tweets from Twitter using the 10% firehose that we obtained from GNIP 5 between January 2014 and October 2014. The 10% firehose is a real-time random sampling of all tweets as they are relayed through the Twitter webservice. As part of their service, GNIP provided a filtering with geo-tagging enabled, so that all of the tweets in our collection were geographically tagged with longitude and latitude, allowing us to pin-point the exact location of the tweet. Initially, we collected over 25.6 million tweets during that time period. In our collection, 24 languages were automatically identified by the Twitter API using the ISO-639-2 and ISO-639-3 language codes 6 . The most commonly occurring languages in our dataset were English, Spanish, Indonesian, and Portuguese. We note that our dataset did not contain any tweets initially identified as being in the Malay language. Figure 1 shows the distribution of languages relative to the overall collection. The language distribution in our data does not accurately represent the languages used on Twitter for two reasons: 1) Twitter's own language ID codes are not always accurate in identifying the language of a tweet, and 2) this distribution in Figure 1 represents 10% geo-enabled firehose from GNIP collected during a specific time period. Furthermore without adequate language ID technology and reliable language labels, the true distribution of languages on Twitter is not known with certainty.

Classification Algorithms
In this section we describe two classification algorithms that we used in our experiments. We compared performance of the MIRA algorithm with the popular pre-trained software called langid.py.

MIRA
Advances in statistical learning theory have made it possible to expand beyond binary classification with perceptrons (Rosenblatt, 1958) to multiclass online learners such as the Margin Infused Relaxed Algorithm (MIRA) from Crammer and Singer (2003). The MIRA algorithm is formulated as a multiclass classifier which maintains one prototype weight vector for each class. MIRA performs similar to Support Vector Machines (SVM) without batch training (Crammer et al, 2006).
For multiclass classification, MIRA is formulated as shown in equation (1): and w is the weight vector which defines the model for class c. The output of the classifier, for each class, is the dot product between a document vector d and the weight vector for each class c, shown in equation (2). Therefore the predicted class is chosen by selecting the argmax. The values for each class, from equation (2) are neither normalized or scaled, and so they do not represent a probability distribution over candidate classes. We discuss this in greater depth in Section 6 with regard to calibrating the classifier output.
To train MIRA, we swept values for the margin slack (0.0005 to 0.00675) and number of training epochs (5 to 30). The value for training epochs denoted a hard-stop for training iterations and served as the stopping criterion. The feature vectors contained log-normalized frequency counts for word and character n-grams, with values for n swept separately for words (1 to 5) and characters (1 to 5), to allow various word and character-level n-gram combinations. After sweeping all possible feature combinations, we report experiment results based on the highest achieved overall accuracy. Words were defined by splitting on whitespace and we did not do any pre-processing or text normalization of the original tweets, similar to Lui and Baldwin (2014). For MIRA we used the open-source software suite called LLClass 7 , which proved useful for other types of text categorization tasks (Shen et al, 2013).

langid.py
For comparison, we used the off-the-shelf tool langid.py from Lui and Baldwin (2012). This tool employs a multinomial näive Bayes classifier, and n-gram feature set. The n-gram features are selected using information gain to maximize information with respect to language while minimizing information with respect to data source. A pretrained model also comes off-the-shelf and covers 97 languages, including the specific languages that we use for this work. At the time of this writing the pre-trained model does not include support for dialect distinction. While we did not sweep parameters for the langid.py software, as we wanted to evaluate off-the-shelf performance, we did use their built-in feature "label constraint" which restricts the multinomial distribution to a specified set of target labels, rather than all 97 supported languages. For example, with experiments involving English/Malay/Indonesian, we restricted the language label set to these three languages.

Re-Annotation Experiments
In this section we present our method to bootstrap filter our Twitter dataset to re-annotated data and arrive at ground truth labels. Our data processing technique is fast, easy, cheap, and independent of the classification algorithm. We also present classification results for each dataset using MIRA and langid.py classifiers. All classification results are reported as the overall average accuracy with an 80/20 train/test split. Each experiment is based on N total tweets per target language and classes were stratified irrespective of tweet length.

Exp 1: Twitter Labels
First for Experiment 1, we used Twitter API labels as ground truth for language classification. Unfortunately, our dataset did not contain Twitter LID labels for Malay, or the Portuguese and Spanish dialects.

Languages
N/class MIRA langid.py en, id 500 98.0 90.1 pt, es, en, id 500 93.5 85.95 Table 1: Exp 1 results using Twitter API language labels as ground truth The performance shown for the English/Indonesian pair in Table 1 is competitive with the DSL Shared Task performance for this language pair (Zampieri et al, 2016). We also used Twitter labels to evaluate multiclass classification for pt, es, en, id and note that the MIRA classifier outperforms langid.py for this set.

Exp 2: Geo-Boundary Filtering
In Experiment 2, we filtered our Twitter dataset by establishing geo-bounding boxes to geographically define countries where the language of interest is suspected to be most prominent. For example, we used the country Malaysia as a representative geo-source for Malaysian tweets. We used a free website to set up the latitudinal and longitudinal geo-bounding boxes around the countries 8 and there are additional alternative websites to obtain similar geo-boundaries 910 . Each bounding box corner was defined by a latitude/longitude coordinate pair corresponding to SW, NW, SE, NE. Multiple bounding boxes were used for approximating the shape of each country and we made every effort to include major metropolitan cities within the bounds. In some cases, our bounding boxes were slightly overspecified and slightly underspecified depending on the geometric shape of the country as shown for Portugual in Figure 2.
We recognize that Twitter users in each of the geo-bounded countries are able to tweet in any language. Our data filtering method was based on the assumption that the majority of tweets from a country would be composed in that country's most common language. We calculated how frequently different Twitter API language labels occurred within the bounds of the target country de- Figure 2: Example of geo-bounding box to identify tweets that originated from Portugal fine a target label purity, with respect to the expected majority language. This is the conditional probability of the target Twitter LID label occurring in the target country, shown in equation (3) p(label|country) = count label count country (3)  The majority of tweets originating from Malaysia were tagged as id and en. We observed similar scarcity of Malay tweets in Twitter's publicly released language identification datasets 11 . In fact, Malay tweets make up less than 0.001% of Twitter's uniformly sampled dataset despite API support for Malay language identification. Our estimates of label purity, in addition to Twitter's dataset coverage of Malay, emphasize the persisting need for automatic language disambiguation. We compared classifier performance using geoboundary as a stand-in for ground truth labels, and our results are shown in Table 3.

Exp 3: Geo Filtering + Twitter Labels
To generate ground truth in Experiment 3, we took the intersection of labels from geo-bounds and   Table 4: Exp 3 results using combined geoboundary definitions and Twitter LID labels

Exp 4: Mechanical Turk-Verified Labels
Finally, in Experiment 4 we further refined the ground truth labels obtained from earlier experiments. We verified the target language of tweets using Amazon Mechanical Turk Human Intelligence Tasks (HITs), using the same train/test data from Experiment 3 (before classification). Each HIT contained one tweet. We assigned 3 workers per HIT at the rate of $0.02 USD per HIT and the total cost for MTurk annotation in this work was $360.00 USD. In an effort to ensure that workers were qualified for the task, we allowed only workers who had an MTurk approval rating >95%, however we did not administer a language performance test in this work. To complete a HIT, workers selected one answer to a multiple-choice question, described below, and we did not inform workers that the text was from Twitter.
Instructions: Please indicate which language the text is in. Some text snippets are full sentences while others are partial sentences or phrases. If the text contains more than one language, indicate that in your response. Note that you can ignore URLs, punctuation, and emoticons to decide the language. In order to be paid you must answer each question correctly.
The authors would like to note that this final statement of the instructions to workers was to motivate them to complete the task meaningfully. All workers who completed tasks in the allotted time frame were paid automatically.
Workers were asked to select one of the following three statements, where language X the language label used for train/test in Experiment 3. A1. The text is entirely composed in language X A2. The text is composed in language X and at least one other language A3. None of the text is composed in language X  The annotation results of our MTurk experiment are shown in Table 5. Columns A1, A2, and A3 show the frequency that at least 2 of 3 human annotators agreed on the language condition. We began with 1000 tweets per language for annotation. If fewer than 2 annotators agreed on a condition, the HIT for that tweet was not counted in this analysis. This method of filtering both reduced the amount of data and simultaneously increased our confidence in the labels as ground truth. Our analysis with MTurk shows that the majority of train/test tweets in Experiment 3 were composed entirely in the target language X, with some instances of code-mixing of two or more languages. We used the tweets verified by Mechanical Turk to learn another set of classifiers for Experiment 4, shown in Table 6. The number of tweets per language class is reduced in this dataset, because we used only tweets verified as being 100% in the target language (column A1 from

MIRA Classifier Calibration
Classifier output scores for MIRA and similar algorithms, like SVM, do not correspond to probabilities. For example, the raw score cannot guide the researcher or end user to knowing if a tweet is 80% likely to be English or 50% likely to be English. The ability to transform raw classifier scores into probabilities is very important if the technology is to be used as a consumable for text analytics or as part of an advanced NLP pipeline. In this section, we show how we calibrated scores using output from the MIRA classifier for 3 different experiments from Section 5. As with many classifiers, the raw score output can be difficult to interpret intuitively since the scalar values for each class do not represent a probability distribution over the classes. We used a technique called Platt Scaling, which learns logistic regression from the raw score output of the MIRA classifier. The Platt Scaling technique provides us with a probability distribution on classes and is easy to train and test. For our reliability plots and calibration, we used classifier output scores of test sets from experiments described in Section 5. For the purpose of brevity, we describe classifier scaling using results for one language pair: Indonesian and Malay.

Score Reliability Plots
Reliability plots show how well a classifier's output is calibrated when the true probability distribution for classes is not known (Niculescu-Mizil and Caruana, 2005;Zadronzy and Elkan, 2002;DeGroot and Feinberg, 1983). For this visualization, the classifier output scores, also called predicted values, are normalized between 0 and 1 and then values are binned into 10 bins. The values plotted are the binned scores s versus the conditional probability of correct class prediction given the score, P (c|s(x) = s). A classifier that is wellcalibrated will have values that fall close to the diagonal line x = y.
We normalized the raw classifier output values so that the scores fell between 0 and 1, using exponent-normalization as in equation (4), for a given tweet: where exp c is the normalized score for class c, and s c is the raw classifier output score for class c. We further divide by the sum, so that the normalized class scores for a given tweet sum to 1.
We created reliability plots for the id, ms prediction task from Experiments 2, 3, and 4. Figures 3 -11 show the histogram distribution of normalized classifier scores with the corresponding reliability plot. Recall that each experiment was based on different kinds of ground truth. All of the reliability plots before Platt-scaling exhibit a sigmoidal distribution. The prevalence of our observed sigmoidal distribution is similar to findings from Niculescu-Mizil and Caruana (2005), who noted this shape for learning algorithms based on maximum margin methods, such as SVM. MIRA and SVM both use maximum margin principles and are known to perform similarly, with the additional benefit that MIRA does not require batch training because it is online (Crammer et al., 2006)

Platt Scaling
Platt scaling uses logistic regression to learn a mapping between classifier output scores and probability estimates (Platt, 1999). The output of Platt scaling is a probability distribution over candidate classes, rather than raw scores from the classifier which are often non-intuitive and difficult to interpret (Zadronzy and Elkan, 2002). Platt scaling is traditionally used in binary problems, and adapted to multiclass problems by developing the original classifier as an ensemble of onevs-all classifiers, then fitting logistic regression for each binary model (Niculescu-Mizil and Caruana, 2005;Zadronzy and Elkan, 2002). We trained and tested logistic regression on a binary class problem with MIRA output using the Logistic Regression library in Python Scikit-Learn, which is designed to handle binary, one-vs-rest, and multinomial logistic regression (Pedregosa et al, 2011).
To build and evaluate logistic regression, we used the test data from our previous experiments, as in Section 6.1, and divided that data into train and test sets with an 80/20 split. For example, the test data from Experiment 2 for id, ms consisted With each of the datasets, Platt-scaling tends to affect calibration probabilities for Indonesian tweets more than for Malay tweets. This is observed as Indonesian data points are closer to the diagonal line. At the same time, the Platt-scaling plots also reveal that predicted values, especially for Malay, are pushed closer to 0 and 1. For example, logistic regression will always correctly predict ms for Malay, when the probability of Malay is > 0.5, but not for Indonesian. This could indicate a need for further data purification.
We examined the accuracy of logistic regression, where the predicted class is taken to be the argmax class probability. In Figure 12, the overall classification accuracy on each dataset is similar for MIRA with and without Platt-scaling. We think this is an important finding because it shows that LID classifier output can be converted into probability distributions without loss of accuracy. What do scores look like for a given tweet? In Table 7   The raw output scores from MIRA, while clearly separating binary classes, are not easily interpreted as a measure of certainty or probability. While the exponent normalized scores do sum to 1, and appear to situate probability mass towards the predicted class, it is not a true probability. The probabilities that are output during Platt-scaling are true probabilities and this method preserves the original MIRA classifier accuracy, thus it is a valid and meaningful technique, especially when language ID is a consumable pre-processing technology for NLP pipelines.

Discussion and Future Work
In this work, we showed that geo-bounding combined with "best-guess" language labels can be used to annotate language labels on easily confused language pairs and dialects, when ground truth is unreliable. In each experiment, we showed how our data purification method resulted in increasing accuracy and classifier performance for both classifiers, MIRA and langid.py. Further, our method to purify language labels is easy to implement for tweets that are geo-tagged with latitude and longitude. Once a model has been learned from geo-tagged tweets, the model can also be used for tweets that are not geo-tagged.
We uncovered hidden Malay tweets in our dataset with high accuracy. We also showed that MIRA is useful for LID, with performance accuracy near state-of-the-art on very few training examples without pre-processing or text cleaning.
While previous work has shown that Malay/Indonesian can be learned using 18,000 training sentences with accuracy as high as 99.6% (Goutte et al., 2014), our result of 90.5% trained on 1600 tweets is competitive with previous work. We believe performance will further increase as more training examples are added with high confidence ground truth labels. Using geo-bounding, we were also able to separate dialects of Spanish and Portuguese to achieve finer-grained distinctions at the dialect level, which the Twitter API does not currently provide.
The highest weighted MIRA n-gram features correspond to high-frequency characters in each target language, suggesting that MIRA is learning features of languages and not Twitter artifacts (URLs, hashtags, @mentions, emoticons, etc).
In future work, we want to explore other easily confused language pairs, such as Ukrainian and Russian. Also, since MIRA is well-formulated for multiclass classification, we are interested in seeing how well it performs on a large multi-language dataset that includes several easily confused language pairs. Sometimes a single tweet will be written in more than one language, for example with code-switching or code-mixing (Barman et al, 2014). We are especially interested in adapting the MIRA classifier for code-switching and language segmentation problems. In the case of codeswitching, it may be possible to utilize raw scores from classifier output or the results of Platt-scaling to construct a model that predict language mixture in a single utterance.