Detecting Code-Switching between Turkish-English Language Pair

Code-switching (usage of different languages within a single conversation context in an alternative manner) is a highly increasing phenomenon in social media and colloquial usage which poses different challenges for natural language processing. This paper introduces the first study for the detection of Turkish-English code-switching and also a small test data collected from social media in order to smooth the way for further studies. The proposed system using character level n-grams and conditional random fields (CRFs) obtains 95.6% micro-averaged F1-score on the introduced test data set.


Introduction
Code-switching is a common linguistic phenomenon generally attributed to bilingual communities but also highly observed among white collar employees. It is also treated as related to higher education in some regions of the world (e.g. due to foreign language usage at higher education). Although the social motivation of code-switching usage has been still under investigation and there exist different reactions to it (Hughes et al., 2006;Myers-Scotton, 1995), the challenges caused by its increasing usage in social media are not negligible for natural language processing studies focusing on this domain.
Social media usage has increased tremendously, bringing with it several problems. Analysis and information retrieval from social media sources are difficult, due to usage of a noncanonical language (Han and Baldwin, 2011;Melero et al., 2015). The noisy character of social media texts often require text normalization, in order to prepare social media texts for data analysis. Eryigit and Torunoglu-Selamet (2017) is the first study which introduces a social media text normalization approach for Turkish. In this study, similar to Han and Baldwin (2011) their candidate word (solution) generation stage comes after an initial ill-formed word detection stage where they use a Turkish morphological analyzer as the language validator. Although this approach works quite well for Turkish posts, it is obvious that it would encounter difficulties in case of code-switching where the language validator would detect every foreign word as ill-formed and the normalizer would try to propose a candidate correction for each of these. A similar situation may be observed at the behavior of spelling checkers within text editors. These also detect the foreign words (purposely written) as out of vocabulary and insist on proposing a candidate correction which makes the detection of actual spelling errors difficult for the users.
In recent years, the use of code-switching between Turkish and English has also become very frequent specifically in daily life conversations and social media posts of white collars and youth population. Ex. (1) introduces such an example which is not unusual to see.
English version For the update processes of our servers, we keep on searching an expert on this domain.
To the best of our knowledge, this is the first study working on automatic detection of codeswitching between Turkish and English. We introduce a small test data set composed of 391 social media posts each consisting of code-switched sentences and their word-by-word manual annotation stating either the word is Turkish or English. The paper presents our first results on this data set which is quite promising with a 95.6% micro average F1-score. Our proposed system uses conditional random fields using character n-grams and word look-up features from monolingual corpora.

Related Work
Code-switching is a spoken and written phenomenon. Hence, its investigation by linguists had started long before the Internet era, dating to 1950s (Solorio et al., 2014). However, code-switching researches concerning Natural Language Processing has started more recently, with the work of Joshi (1982), where a "formal model for intrasentential code-switching" is introduced.
Analysis of code-switched data requires an annotated, multilingual corpus. Although collection of code-switched social media data is not an easy task, there has been worthy contributions. A Turkish-Dutch corpus (Nguyen and Dogruöz, 2013), a Bengali-English-Hindi corpus (Barman et al., 2014), Modern Standard Arabic -Dialectal Arabic, Mandarin -English, Nepali-English, and Spanish-English corpora for the First and Second Workshops on Computational Approaches to Code-switching (Solorio et al., 2014;Molina et al., 2016), a Turkish-German corpus (Özlem Ç etinoglu, 2016), a Swahili-English corpus (Piergallini et al., 2016) and an Arabic-Moroccan Darija corpus (Samih and Maier, 2016) were introduced. Social media sources are preferred, due to the fact that social media users are not aware that their data are being analyzed and thus generate text in a more natural manner (Ç etinoglu et al., 2016). To our knowledge, a Turkish-English codeswitching social media corpus has not yet been introduced.
Word-level language identification of codeswitched data has proved to be a popular research area with the ascent of social media. Das and Gambäck (2013) applied language detection to Facebook messages in mixed English-Bengali and English-Hindi. Chittaranjan et al. (2014) carried out the task of language detection for codeswitching feeding character n-grams to CRF, with addition to lexical, contextual and other special character features, and reached 95% labeling accuracy. Nguyen and Dogruöz (2013) identified Dutch-Turkish words using character n-grams and dictionary lookup as CRF features along with contextual features, reaching 98% accuracy, and introducing new methods to measure corpus complexity. These researches mostly depend on monolingual training data (Solorio et al., 2014). As opposed to monolingual training data, Lignos and Marcus (2013) used code-switched data for language modeling where they use a Spanish data set containing 11% of English code-switched data. However, the usage of code-switched data for training is problematic, since its size is generally low, and may be insufficient for training (Maharjan et al., 2015).
Shared tasks on code-switching (Solorio et al., 2014;Molina et al., 2016) contributed greatly to the research area. First Workshop on Computational Approaches to Code-switching (FWCAC) showed that, when typological similarities are high between the two languages (Modern Standard Arabic-Dialectal Arabic (MSA-DA) for instance), and they share a big amount of lexical items, language identification task becomes considerably difficult (Solorio et al., 2014). It is easier to define languages when the two are not closely related (Nepali-English for instance).

Language Identification Models
This section presents our word-level identification models tested on Turkish-English language pair.
In order to observe how corpora with different sources (formal or social media) affect language modeling and word-level language identification on social media texts, TNC and TTC are paired with the ETD, and the model perplexities are calculated against the code-switched corpus (CSC). Language labels are decided upon the comparison of English and Turkish model perplexities for each token in the test set.

Conditional Random Fields (CRF)
Conditional Random Fields (CRF) perform effectively in the sequence labeling problem for many NLP tasks, such as Part-of-Speech (POS) tagging, information extraction and named entity recognition (Lafferty et al., 2001). CRF method was employed by Chittaranjan et al. (2014) for word-level language detection, using character n-gram probabilities among others as a CRF feature, reaching 80% -95% accuracy in different language pairs. In this research we also experiment with CRF for word-level language identification, where language tagging is considered as a sequence labeling problem of labeling a word either with English or Turkish language tags. Our first CRF model "CRF † " uses lexicon lookup (LEX), character n-gram language model (LM) features and the combination of these for the current and neighboring tokens (provided as feature templates to the used CRF tool (Kudo, 2005)). LEX features are two boolean features stating the presence of the current token in the English (ETD) or Turkish dictionary (TNC or TTC). LM feature is a single two-valued (T or E) feature stating the label assigned by our previous (Ch.n-gram) model introduced in §3.1.
Turkish is an agglutinative language. Turkish proper nouns are capitalized and an apostrophe is inserted between the noun and any following inflectional suffix. It is frequently observed that code-switching people apply the same approach while using foreign words in their writings. Ex.
(2) provides such an example usage: (2) action'lar code-switched version eylemler Turkish version actions English version In such circumstances, it is hard for our character-level and lexicon look-up models to assign a correct tag where an intra-word codeswitching occurs and the apostrophe sign may be a good clue for detecting these kinds of usages. In order to reflect this know-how to our machine learning model, we added new features (APOS) to our last model "CRF φ " (in addition to previous ones). APOS features are as follows: a boolean feature stating whether the token contains an apostrophe (') sign or not, a feature stating the language tag (E or T) assigned by ch.n-gram model to the word sub-part appearing before the apostrophe sign (this feature is assigned an 'O' (other) tag if the previous boolean feature is 0) and a final feature which is similar to the previous one but this time stating whether this sub-part appears in one of the language dictionaries (E/T/O).

Data
Our character-level n-gram models were trained on monolingual Turkish and English corpora retrieved from different sources. We also collected and annotated a Turkish-English code-switched test data-set and used it both for testing of our ngram models and training (via cross-validation) of our sequence labeling model.
The monolingual English training data (ETD) was acquired from the Leipzig Corpora Collection (Goldhahn et al., 2012), containing English text from news resources, incorporating a formal language, with 10M English tokens. For the Turkish training data, two different corpora were used. The first corpus was artificially created using the word frequency list of the Turkish National Corpus (TNC) Demo Version (Aksan et al., 2012). TNC mostly consists of formally written Turkish words. Second Turkish corpus (TTC) (6M tokens) was extracted using the Twitter API aiming to obtain a representation of the non-canonical usergenerated context.
For the code-switched test corpus (CSC), 391 posts all of which containing Turkish-English code-switching were collected from Twitter posts and Ekşi Sözlük website 1 . The data was crossannotated by two human annotators. A baseline assigning the label "Turkish" to all tokens in this dataset would obtain a 72.6% accuracy score and a 42.1% macro-average F1-measure. Corpus statistics and characteristics are provided in Table 1. 2

Experiments and Discussions
We evaluate our token-level language identification models using precision, recall and F 1 measures calculated separately for both language classes (Turkish and English). We also provide micro 3 and macro averaged F 1 measures for each model.  the baseline model which assigns the label "Turkish" to all tokens and the second row provides the results of a rule-based lexicon lookup (base LL) which assigns the language label for each word by searching it in TNC and ETD used as Turkish and English dictionaries. If a word occurs in both or none of these dictionaries, it is tagged as Turkish by default. We observe from the results that the characterlevel n-gram models trained on a formal data set (TNC) fall behind our second baseline (with 88.7% macro avg. F 1 ) whereas the one trained on social media data (TTC) performs better (91.1%). It can also be observed that the performances of character n-gram language models turned out to be considerably high, aided by the fact that Turkish and English are morphologically distant languages and contain differing alphabetical characters such as "ş,g,ü,ö,ç,ı"in Turkish and "q,w,x" in English.
CRF models' performances are calculated via 10 fold cross-validation over code-switched corpus (CSC). One may see from the table that all of our CRF models perform higher than our baselines and character n-gram models. The best performances (95.6% micro and 94.5% macro avg. F 1 ) are obtained with CRF φ trained with LEX + LM + APOS features. Contrary to the above findings with character level n-gram models, we see that CRF φ performs better when TNC is used for character-level n-gram training and look-up. The use of TTC (monolingual Turkish data collected from social media) was revealing better results in Ch.n-gram models and similar results in CRF † . This may be attributed to the fact that our hypothesis regarding the use of apostrophes in codeswitching of Turkish reveals a good point and the validation of the word sub-part before the apostrophe sign (from a formally written corpus -TNC) brings out a better modeling.

Conclusion
In this paper, we presented the first results on code-switching detection between Turkish-English languages. With the motivation of social media analysis, we introduced the first data set which consists 391 posts with 5430 tokens (having ∼30% English words) collected from social media posts 4 . Our first trials with conditional random fields revealed promising results with 95.6% micro-average and 94.5 macro-average F 1 scores. We see that there is still room for improvement for future studies in order to increase the relatively low F 1 (91.9%) scores on English. As future works, we aim to increase the corpus size and to test with different sequence models such as LSTMs.