The Howard University System Submission for the Shared Task in Language Identification in Spanish-English Codeswitching

This paper describes the Howard University system for the language identiﬁcation shared task of the Second Workshop on Computational Approaches to Code Switching. Our system is based on prior work on Swahili-English token-level language identiﬁcation. Our system primarily uses character n -gram, preﬁx and sufﬁx features, letter case and special character features along with previously existing tools. These are then combined with generated label probabilities of the immediate context of the token for the ﬁnal system.


Introduction & Prior Approaches
The internet and social media have led to the emergence of new registers of written language (Tagliamonte and Denis, 2008). One of the effects of this is the emergence of written codeswitching as a common occurrence (Cárdenas-Claros and Isharyanti, 2009). The First Workshop on Computational Approaches to Codeswitching brought increased attention to this phenomenon. This paper is our submission for the shared task in token-level language identification in codeswitched data for the second such workshop. Our submission is for the Spanish-English language pair.
Our approach was informed particularly by the submissions to the previous shared task in language identification in codeswitched data. Most, if not all, of the previous approaches to word-level language identification utilized character n-grams as one of the primary features (Solorio et al., 2014). Nguyen and Dogruöz (2013) and all but one of the systems  (Solorio et al., 2014;Volk and Clematide, 2014).

Data Description
Several thousand tweets were collected from Twitter and labeled by human annotators. Each token was labeled as being English, Spanish, ambiguous (words like no which are valid words in both languages and can't be disambiguated by context), mixed (tokens with elements from both languages), foreign (words from other languages), a named entity, "unknown" (tokens like "asdfhg") and "other". The "other" category includes numbers (unless they represent a non-numerical word, like <2> used for "to"), punctuation, Twitter @-mentions, URLs, emojis and emoticons. These tweets were divided into train, development and test sets and released 1 to the participants in the shared task. Basic statistics about the train, development and test sets can be seen in Table 2. As can be seen, the proportion of English and Spanish is significantly different for the test set compared to the other two sets. Systems were evaluated at the tweet level as well. For this purpose, tweets are considered as either monolingual or codeswitched. A codeswitched tweet must have tokens from at least two of the following categories: English, Spanish, mixed and/or foreign. All other tweets are considered monolingual.

Methodology
In another paper, also submitted to this conference, we experimented with a number of features for token-level language identification on mixed Swahili-English data (Piergallini et al., 2016). For this shared task, we modified our approach in a few ways due to the parameters of the task and also explored the use of a few new features. These are described below: 1) Word 2) Character n-grams (1-to 4-grams) 3) Word prefixes and suffixes (length 1 to 4) For features 1-3, we filtered out words, n-grams, prefixes and suffixes that occurred less than 25 times for training our model. N-grams, prefixes and suffixes were also converted to lower-case at the three and four character length to reduce sparsity.

4) English-Spanish dictionary
The dictionary feature checks the token against the English and Spanish dictionaries used in the GNU Aspell package 2 and marked according to whether it was in one or both of the English or Spanish dictionaries, or neither. 5) English POS tag 1 Data was released by providing tweet ID numbers. Participants scraped the text of the tweets themselves. Since Twitter users may delete or restrict access to their tweets, not all participants may have had the exact same subset of the full data.
2 Available here: https://github.com/WojciechMula/aspellpython 6) Spanish POS tag The part-of-speech tags were generated by the Stanford NLTK POS tagger (Toutanova et al., 2003). The Spanish tags were truncated at three characters to reduce sparsity.

7) Named entity tag
Tweets were labeled with the named entity recognition system described in Ritter et al. (2011). This system was developed for use on Twitter data.

8) Brown cluster and cluster prefixes
Brown clustering groups word types into a binary tree structure based on word context (Brown et al., 1992). Clusters tend to correlate with syntactical and semantic categories. They also correlate with language, since words of one language tend to co-occur with other words of the same language. To generate these clusters, we lower-cased all words and replaced all Twitter user names with "@username". We used 400 clusters based on the size of the data and the desire for some distinctions beyond basic word classes. Words that occur infrequently tend to be quite noisy in how they are clustered, so words that occurred less than 10 times were not given a cluster. To take advantage of the binary tree structure, we included features based on prefixes of the cluster. For example, in our clusters, nodes beginning with <0> were mostly Spanish words, while nodes beginning with <11> were mostly English words.
The remaining features are binary flags: 9) Is there a Latin alphabetic character?
10) Is there a Spanish-specific letter?
Spanish-specific characters are limited to accented vowels, <ü> and <ñ>. These are strong indicators of a word being Spanish, but they do not all occur equally frequently, so this feature reduces sparsity. For example, <ó> occurs approximately 40 times more frequently than <ü>. This is the most language-specific feature we use. These characters occur extremely infrequently in English text compared to Spanish text. A language-independent conceptualization of this would be whether the word contains a member of the relative complement of the set of English letters in the set of Spanish letters. Such a feature would not be useful in the other direction since the 26 letters of the English alphabet are all used in Spanish, particularly in online usage (<w> and especially <k> are not limited to loanwords in internet Spanish writing).
11) Is there a number character? 12) Is the token a numerical expression?
Feature 12 is true for tokens which consist entirely of digits, mathematical symbols, and characters used for expressions of time ("12:00") or currency symbols (<$>), etc.

13) Is there an emoji Unicode character?
Since all tokens composed of emojis are labeled as "other", this feature does not rely on a particular emoji occurring in our training data to accurately classify tokens in the test data.
14) Does the token begin a tweet/sentence? 15) Is the first letter capitalized?

16) Are all of the other letters upper case?
17) Are all of the other letters lower case?
The last four features consider capitalization. These features was added particularly to account for named entities and abbreviations, acronyms, etc. which are typically capitalized or in all upper-case letters.
Since words at the beginning of sentences are frequently capitalized, eliminating what is usually a distinction between proper and common nouns, feature 14 should reduce the weight towards labeling a word as a named entity. Finally, we used logistic regression with L2regularization to generate label probabilities on tokens using the various combinations of the first 14 features. The label probabilities of the previous and following tokens were then added to the feature vector for each token. Tokens at the beginning or end of a tweet were given all zero probabilities for the absent context. This was found to significantly improve performance in our work on Swahili-English codeswitching (Piergallini et al., 2016) and is simpler than CRF 3 . A second logistic regression model was then trained and applied to the final feature set.

Results & Discussion
The results of various feature combinations on the development set are summarized in Table 3. Four of the labels are excluded from the table. None of our models ever predicted a token to be ambiguous, mixed or foreign because these categories were all very rare in the both the training and development data. Conversely, the other category was very easily predicted by even the baseline model and achieved F1 scores of about 99.8% for all configurations.
There is not a high variation in the accuracy based on the features used. What can be seen is that the addition of the label probabilities for the previous and following word consistently adds about 2% to the overall accuracy and improves performance on the English and Spanish categories. It seems that part-of-speech tags and Brown clusters are not especially helpful. It is possible POS tags they could be more useful with a coarser POS tag set, or that the Brown clusters could be more useful with different pre-processing. The use of the named entity recognizer does improve performance on the named entity category significantly, but it did not improve overall accuracy much.
For our predictions on the test data, we used features 2-7 and 9-14 with label probabilities on the word context. Results for our submitted predictions are summarized in Table 3.1. According to the released results, our system never correctly labeled a token as ambiguous or mixed. It also never labeled a token as foreign at all. There are two versions: one with the original test data, and one which excludes tweets which contained URLs. We overlooked URLs in designing our model since they never occurred in the training or development data, although our model likely would've labeled them correctly had they occurred in training. Nevertheless, we achieve an overall accuracy in line with other systems without correcting for this. When tweets containing URLs are excluded, we achieve the highest performance on several measures. Those measures which were highest among submitted systems are noted in bold.
To improve on our model, adding a feature or procedure for properly handling URLs would be the obvious first change to make. However, this does not account for all of the errors in our predictions.   Table 3: Performance of the final system on the test data Notably, our system does poorly with ambiguous, mixed and foreign words. This is largely due to there being very few instances of these categories. We also suspect that dealing with them would require some special approaches to account for their particular features. For example, a mixed language word would be expected to have some n-grams found in both English and Spanish, but logistic regression can't easily account for this type of pattern. A feature designed to represent the interaction between the English-and Spanish-like features of a mixed word would be required. It is also possible that some tokens were mislabeled. In our examination, it seemed that the ambiguous and mixed categories were not consistently distinguished. It is also evident that our system does much worse on named entities than on other large categories. It could be that the tool we used did not have a comprehensive list of named entities (we missed "Orange Is the New Black", for example). It was also only trained on English. Our case features may also be more powerful when combined rather than made into separate binary features. There is an interaction between whether the first letter or all letters are upper or lower case and whether the word is at the beginning of a sentence, and the algorithm we used cannot capture that easily. This could potentially slightly improve performance on named entities. We would also note that English and Spanish do not consider the same types of words to be proper nouns, and this may be the cause for some inconsistencies in the annotations that we noticed.