The Non-native Speaker Aspect: Indian English in Social Media

As the largest institutionalized second language variety of English, Indian English has received a sustained focus from linguists for decades. However, to the best of our knowledge, no prior study has contrasted web-expressions of Indian English in noisy social media with English generated by a social media user base that are predominantly native speakers. In this paper, we address this gap in the literature through conducting a comprehensive analysis considering multiple structural and semantic aspects. In addition, we propose a novel application of language models to perform automatic linguistic quality assessment.


Introduction
Analyzing important issues through the lens of social media is a thriving field in computational social science (CSS) research. From policy debates (Demszky et al., 2019) to modern conflicts (Palakodety et al., 2020a), web-scale analyses of social media content present an opportunity to aggregate and analyze opinions at a massive scale. English being one of the widely-spoken pluricentric languages (Leitner, 1992), a considerable fraction of current CSS research primarily analyzes content authored in English. Several recent lines of CSS research on Indian sub-continental issues (Palakodety et al., 2020a;Tyagi et al., 2020;Palakodety et al., 2020c) dealt with Indian English (Mehrotra, 1998), a regional variant of English spoken in India and among the Indian diaspora.
As the largest institutionalized second language variety of English, Indian English has received sustained attention from linguists (Kachru, 1965;Shastri, 1996;Gramley and Pätzold, 2004; Sedlatschek, * Ashiqur R. KhudaBukhsh is the corresponding author. 2009) delineating multiple aspects in which Indian English is distinct from US or British English. However, these studies are largely confined to well-formed English written in formal settings (e.g., newspaper Dubey, 1989;Sedlatschek, 2009. The efforts so far in characterizing web-expressions of Indian English are somewhat scattered with isolated focus areas (e.g., code switching (Gella et al., 2014;Rudra et al., 2019;Khanuja et al., 2020;KhudaBukhsh et al., 2020a), use of swear words Agarwal et al., 2017, andword usage Kulkarni et al., 2016) and little attention given to analyzing the range of spelling, grammar and structural characteristics observed in web-scale Indian English corpora. Due to the deep penetration of cellphone technologies into Indian society and availability of inexpensive data (HuffPost, 2017), a user base with a wide range of English proficiency has access to the social media. Hence, understanding to what extent spelling and grammar issues affect Indian English found on the social web and how does that compare and contrast with typical noisy social media content generated by predominantly native English speakers is an important yet underexplored research question.
In this paper, via two substantial contemporaneous corpora constructed from comments on YouTube videos from major news networks in the US and India, we address the above research question. To the best of our knowledge, no prior study has contrasted any Indian English social media corpus with the variety of English observed in social media platforms frequented by native English speakers. We further use two existing corpora of news articles from India and the US to demonstrate that while college-educated, well-formed English across these two language centers does not differ by much, social media Indian English is different from social media US English on certain aspects, hence may pose a greater challenge to conduct meaningful analysis. Apart from using standard tools to assess linguistic quality, we present a novel finding that recent advances in language models can be leveraged to perform automated linguistic quality assessment of human-generated text.

Data Sets
We consider two social media (denoted by the superscript sm) data sets and two news article (denoted by the superscript na) data sets. We denote Indian English and US English as en-in and en-us, respectively. In order to keep our vocabulary statistics comparable, we sub-sample from our en-us social media data set and ensure that both social media corpora have nearly equal number of tokens. Detailed description of preprocessing steps are presented in the Appendix. Why YouTube? Both of our social media corpora are comments on YouTube videos posted within an identical date range (30 th January, 2020 to 7 th May, 2020). As of January 2020, YouTube is the secondmost popular social media platform in the world drawing 2 billion active users (Statista, 2020b). It is the most popular social media platform in India with 265 million monthly active users, accounting for 80% of the population with internet access (Hin-dustanTimes, 2019; YourStory, 2018).
• D sm en-in : We consider a subset of a data set first introduced in (KhudaBukhsh et al., 2020b). The original data set consists of 4,511,355 comments by 1,359,638 users on 71,969 YouTube videos from fourteen Indian news outlets posted between 30 th January, 2020 and 7 th May, 2020. Next, language is detected usingL polyglot , a polyglot embedding based language identifier first proposed in (Palakodety et al., 2020a) and successfully used in other multi-lingual contexts (Palakodety et al., 2020c). This yields 1,352,698 English comments (23,124,682 tokens, 2,107,233 sentences). In order to minimize the effects of code switching (Gumperz, 1982;Myers-Scotton, 1993), only sentences with low CMI (code mixing index) (Das and Gambäck, 2014) are considered. We estimate CMI using the same method presented in Khud-aBukhsh et al. 2020a and set a threshold of 0.1. Upon removal of code switched sentences, our final data set, D sm en-in , consists of 1,923,292 sentences (20,591,213 tokens).
• D sm en-us : We consider a subset of a data set first introduced in (KhudaBukhsh et al., 2020c). We first obtain 10,245,348 comments posted by 1,690,589 users 1 on 8,593 YouTube videos from three popular US news channels (Fox news, CNN and MSNBC) (Statista, 2020a) in the same time period. We subsampled the data to make the number of tokens comparable to that of D sm en-in . This resulted in D sm en-us having 1,573,355 sentences (20,591,220 tokens).
• D na en-in consists of 398,960 sentences (9,016,255 tokens) from news articles that appeared in highly circulated Indian news outlets (e.g., The Quint, Hindustan Times, Deccan Herald) (Dai, 2017).
• D na en-us consists of 94,463 sentences (2,042,024 tokens) from news articles that appeared in highly circulated US news outlets (e.g., HuffPost, Washington post, New York Times).

Vocabulary and Grammar
We conduct a detailed study comparing and contrasting D sm en-in and D sm en-us . In what follows, we summarize our observations (see, Appendix for details).
• Vocabulary: In the context of social media, US English exhibits a richer overlap with standard English dictionary as compared to Indian English.
Let V dict denote the English vocabulary obtained from a standard English dictionary (Kelly, 2016) 2 . Let V sm en-in and V sm en-us denote the vocabularies of D sm en-in and D sm en-us , respectively. We now compute the following overlaps: |V sm en-us ∩ V dict | = 43, 826 and |V sm en-in ∩ V dict | = 38, 260. Also, with a list of 6,000 important words for US SAT exam 3 , we find that V sm en-us has considerably larger overlap (4,349 words) than V sm en-in (3,956 words). • Spelling deviations: Indian English exhibits larger spelling deviations as compared to US English. Phonetic spelling errors (i.e., spelling a word as it sounds) are common in Indian English. This observation aligns with (KhudaBukhsh et al., 2020a).
• Loanwords: Borrowed words, also known as loanwords, are lexical items borrowed from a donor language (Holden, 1976;Calabrese and 1 The Jaccard similarity between the two social media user bases of D sm en-in and D sm en-us is 0.01 indicating minimal overlap between the two user bases. 2 We take the union of en-us and en-gb. 3 https://satvocabulary.us/INDEX.ASP? CATEGORY=6000LIST Wetzels, 2009;Van Coetsem, 2016) . For example, the English word avatar or yoga is borrowed from Hindi. We observe that loanwords (e.g., sadhus, begum, burqa, imams and gully) borrowed from Hindi heavily feature in Indian English.
• Article and pronoun usage: Indian English uses considerably fewer articles and pronouns as compared to US English. Pronoun and article omissions in ESL (English as Second Language) are well-studied phenomena (Ferguson, 1975). Our observation also aligns with a previous field study (Agnihotri et al., 1984) that reported even college-educated Indians make substantial errors in article usage.
• Preposition usage: Indian English uses considerably fewer prepositions as compared to US English (11.48% in en-us and 10.84% in en-in).
• Verb usage: Indian English uses fewer verbs than US English. Of the different verb forms (see, Figure 1), Indian English uses the root form relatively more than US English indicating (possible) poorer understanding of subject-verb agreement and tense (later verified in Section 3.2). • Sentence length: We observe shorter sentences in Indian as compared to US English (average en-in sentence length: 10.71 ± 12.37; average en-us sentence length: 13.09 ± 20.17). We acknowledge that device variability may influence this observation. • Sentence validity evaluated by a parser: A standard parser evaluates fewer Indian English sentences as valid as compared to US English (see , Table 1). However, no such discrepancy was observed in news article English from both language centers. • Constituency parser depth: For a given sentence length, Indian English exhibits lesser average constituency parser tree depth (Joshi et al., 2018) indicating (possible) structural issues. Intuitively, length of a sentence is likely to be positively correlated with its structural complexity; a long sentence is likely to have more complex (and nested) sub-structures than a shorter one. A parser's ability to correctly identify such sub-structures depends on the sentence's syntactic correctness. To tease apart the relationship between sentence-length and constituency parser's depth, in Figure 2, we present the average tree depth for a given sentence length. We observe that between well-formed English, the difference is almost imperceptible. However, as the sentence length grows, the gap between tree depth obtained in social media en-in and the rest widens indicating possible structural issues. A few example long sentences with small parse-tree depth are presented in the Appendix.
• Generalizability across other native English variants: Our results are consistent when compared against a British English (en-gb) social media corpus.

Cloze Test
Recent advances in Language Models (LMs) such as BERT (Devlin et al., 2019) have led to a substantial improvement in several downstream NLP tasks. While obtaining task-specific performance gain has been a key focus area (see, e.g., Liu and Lapata, 2019; Lee et al., 2020), several recent studies attempted to further the understanding of what exactly about language these models learn that results in these performance gains. LMs' ability to solve long-distance agreement problem and general syntactic abilities have been previously explored (Gulordava et al., 2018;Marvin and Linzen, 2018;Goldberg, 2019).
BERT's masked word prediction has a direct parallel in human psycholinguistics literature (Smith and Levy, 2011). When presented with a sentence (or a sentence stem) with a missing word, a cloze task is essentially a fill-in-the-blank task. For instance, in the following cloze task: In the [MASK], it snows a lot, winter is a likely completion for the missing word. In fact, when given this cloze task to BERT, BERT outputs the following five seasons ranked by decreasing probability: winter, summer, fall, spring and autumn. Word prediction as a test of LM's language understanding has been explored in Paperno et al. (2016); Ettinger (2020) and recent studies leveraged it in novel applications such as relation extraction (Petroni et al., 2019) and political insight mining (Palakodety et al., 2020b). Bolstered by these findings and another recent result that uses BERT to evaluate the quality of translations (Zhang* et al., 2020), we propose an approach to estimate language quality using BERT. Our hypothesis is if a sentence is syntactically consistent and semantically coherent, BERT will be able to predict a masked word in that sentence with higher accuracy than a syntactically inconsistent or semantically incoherent sentence.
We first motivate our method with two examples. Consider the following classic syntactically correct yet semantically incoherent sentence (Chomsky, 1957): Colorless green ideas sleep furiously. BERT's top five predictions for a cloze task Colorless green MASK sleep furiously are the following: (1) eyes (2) . (3) , (4) they and (5) I. In fact, none of these words when masked, features in BERT's top 100 predictions. However, for another iconic sentence (King, 1968) when presented in the following cloze form: I have a MASK that my four little children will one day live in a nation where they will not be judged by the color of their skin, but by the content of their character, BERT's top five predictions (feeling, hope, belief, vision, and dream) correctly include dream. Notice that, in our semantically coherent example sentence, all of the predicted words have (Part-Of-Speech) POS agreement with the masked word while the semantically incoherent sentence produced a wide variety of completion choices that include punctuation, pronouns and noun.
We randomly sample 10k sentences from each corpus. For each sentence, we mask a randomly chosen word in the sentence such that w ∈ V dict and construct an input cloze statement. Following standard practice (Petroni et al., 2019), we report p@1, p@5 and p@10 performance. p@K is defined as the top-K accuracy, i.e., an accurate completion of the masked word is present in the retrieved top K words ranked by probability. Table 2 summarizes BERT's performance in predicting masked words. We were surprised to notice that on well-formed sentences, BERT achieved higher than 80% accuracy indicating that wellformed sentences leave enough cues for an LM to predict a masked word with high accuracy. We further observe that prediction accuracy is possibly correlated with linguistic quality; the performance on well-formed text corpora is substantially better than that of on social media text corpora. Finally, among the two social media text corpora, the performance on Indian English is substantially worse possibly indicating larger prevalence of grammar, spelling or semantic disfluencies. A few randomly sampled examples are listed in Table 4  We next compute the fraction of instances where the POS tags of the masked word and the predicted word agree. As shown in Table 3, the POS agreements on well-formed text corpora are substantially higher than that of on the social media corpora.
Once again, we observe that of the two social media corpora, POS agreement on D sm en-in corpus is lower than that of on D sm en-us . Our results inform that masked word prediction accuracy can be an effective measure in evaluating linguistic quality. Additional results with a British English corpus is presented in the Appendix.

Conclusions
In this paper, we present a comprehensive comparative analysis between Indian English and US English in social media. Our analyses reveal that compared to native English, social media Indian English exhibits certain differences that may add to the challenges of navigating noisy, social media texts generated in the Indian sub-continent and thus present an opportunity to the NLP community to address these challenges. Recent lines of computational social science (CSS) research focusing on Indian sub-continental issues emphasized on the challenges faced while processing Indian social media data. However, no prior work contrasted social media Indian English with social media native English. We believe our work will help the research community identify focus areas to facilitate CSS research in this domain. We present a novel approach to perform automated linguistic quality assessment using BERT, a well-known high-performance language model. To the best of our knowledge, our work first tests BERT's masked word prediction accuracy on human-generated texts obtained from noisy social media. World variants of English spoken and written form have been widely studied for several decades. However, limited literature exists on characterizing their social media expressions.  SV Shastri. 1996. Using computer corpora in the description of language with special reference to complementation in indian english. South Asian English: structure, use, and users, 2(4):70-81.
Nathaniel Smith and Roger Levy. 2011. Cloze but no cigar: The complex relationship between cloze, corpus, and subjective probabilities in language processing. In  Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.

Data Sets
Preprocessing: We apply the following standard preprocessing steps on the raw comments.
• We convert all comments to lowercase and remove all emojis and junk characters.
• We replace multiple occurrences of punctuation with a single occurrence. For example, they got trapped!!!!!!! is converted into they got trapped!.
• We use an off-the-shelf sentence tokenizer from NLTK (Bird and Klein, 2009) to break up the comments into sentences.

Vocabulary and Grammar
Observation: In the context of social media, US English exhibits a richer overlap with standard English dictionary as compared to Indian English. Analysis: Let V dict denote the English vocabulary obtained from a standard English dictionary (Kelly, 2016) 4 . Let V sm en-in and V sm en-us denote the vocabularies of D sm en-in and D sm en-us , respectively. We now compute the following overlaps: |V sm en-us ∩ V dict | = 43, 826 and |V sm en-in ∩ V dict | = 38, 260. Also, with a list of 6,000 important words for US SAT exam 5 , V sm en-us has considerably larger overlap (4,349 words) than V sm en-in (3,956 words).
Observation: In the context of social media, Indian English exhibits larger deviation from standard spellings as compared to US English. Analysis: We compute the extent of spelling deviations in the following way. For each out-ofvocabulary (OOV) word that has appeared at least 5 or more times in a given corpus, we use a standard spell-checker 6 to map it to a dictionary word 4 We take the union of en-us and en-gb. 5 https://satvocabulary.us/INDEX.ASP? CATEGORY=6000LIST 6 https://norvig.com/spell-correct.html present that also has appeared 5 or more times in the corpus. We observe that, overall, 9,653 en-in words had at least one or more spelling variations (or errors) while 5,436 en-us words had at least one or more spelling variations (or errors). The average number of variations (or errors) per word are 2.15 and 1.42 for en-in and en-us, respectively, indicating that Indian English exhibit larger deviation from standard spellings. Qualitatively, we notice that words with a large number of vowels are particularly prone to spelling variations (or errors), for instance, the word violence has the following misspellings in en-in: voilence, voilance, and violance. In en-us, violence did not have any high-frequent (occurring 5 or more times in the corpus) misspelling. We further observe that phonetic spelling errors are highly common in enin. For instance, the word liar is often misspelled as lier and the word people is often misspelled as peaple.
Observation: Loanwords borrowed from Hindi heavily feature in Indian English. Analysis: Table 6 lists highly frequent words that belong to one social media corpus but absent in the other. We observe that loanwords (Holden, 1976;Calabrese and Wetzels, 2009;Van Coetsem, 2016) (e.g., sadhus, begum, burqa, imams and gully) feature in Indian English. Few nouns are actually used in different proper noun contexts. For example, raga, originally a Hindi loanword that means a musical construct, is actually used to refer to Rahul Gandhi, a famous Indian politician. Similarly, newt (a salamander species) and tapper refer to American politician Newt Gingrich and American journalist Jake Tapper, respectively. We note that terms specific to US politics (e.g., gerrymandering, caucuses, senates) and specific Indian political discourse (e.g., demonetization, secularists) solely appear in the relevant corpus. Words specific to Indian sports culture (e.g., cricketers) only appear in Indian English while US healthcare-specific words (e.g., deductibles) never appear in Indian English.
Observation: Indian English uses considerably fewer articles and pronouns as compared to US English. Analysis: We next compute the respective unigram distributions P en-in and P en-us . For each token Solely present in V sm en-in Solely present in V sm en-us sadhus, pelting, raga, begum, bole, indigo, demonetization, defaulters, bade, burqa, secularists, demonetisation, rioter, labourer, madrasas, rickshaw, gully, introspect, cricketers, defaulter, imams tapper, impeachable, newt, caucuses, electable, subpoenas, jurors, mittens, clapper, brokered, reassigned, munchkin, gaffe, buybacks, senates, gerrymandering, impeachments, felonies, blowhard, centrists, deductibles   t ∈ V dict ∩ V sm en-us ∩ V sm en-in , we compute the scores P en-in (t) − P en-us (t), and P en-us (t) − P en-in (t) and obtain the top tokens ranked by these scores (indicating increased usage in the respective corpus). Table 7 captures few examples with highest difference in unigram distribution. Overall, we notice that considerably fewer articles are used in en-in. Pronoun and article omissions in ESL (English as Second Language) are well-studied phenomena (Ferguson, 1975). This also aligns with a previous field study (Agnihotri et al., 1984) that reported even college-educated Indians make substantial errors in article usage.

Top tokens in
Top tokens in Pen-us(t) − Pen-in(t) Pen-in(t) − Pen-us(t) the, a, trump, that, he, to, president, and, I, you, his, it, get, democrats, just, out, up, was, would, about should, sir, u, police, good, are, in, govt, corona, very, please, is, these, them, congress, government, by, shame, only, pm  Observation: In the context of social media, Indian English uses considerably fewer prepositions as compared to US English. Analysis: We consider a list of highly frequent prepositions and find that Indian English uses fewer prepositions than US English (11.48% in en-us and 10.84% in en-in). We manually inspect usage of 100 randomly sampled sentences with the preposition in. 97 of such instances are evaluated correct by our annotators.
Observation: In the context of social media, Indian English uses fewer verbs than US English. Analysis: In Figure 3, we summarize the relative occurrence of different verb forms. Of the different verb forms, Indian English uses the root form relatively more than US English indicating (possible) poorer understanding of subject-verb agreement and tense. Observation: A standard parser evaluates fewer Indian English sentences as valid as compared to US English. Analysis: We consider the same randomly sampled 10k sentences from each data set, and run a well-known constituency parser (Joshi et al., 2018). We first measure the fraction of sentences that are labeled as valid sentences by the parser. Table 8 shows that more than 96% sentences of both news article corpora are determined valid by the parser. Understandably, the fraction of valid sentences in the social media corpora is less with D sm en-in having few valid sentences than D sm en-us .  Observation: For a given sentence length, Indian y another pm help fund what is the need of that coz there is already a pm relief fund and its has a committee with opposition party memeber too . thanks to god that we have priminster like a modi ji he is our great mister i salute my priminister may god bless him always he has be long life i request news reporter to use mask, plz do this, bcoz you are facing more dangerous situation only for public sake, plz sir i request to inform our reporter,they help ussssss if sibal and singhvi becomes enjoy similar positions den its ok for dem kapilbsibal is ant national a gunda good for u sir dey r sour grapes and crook of sonia gandhi Table 9: Random sample of long sentences in D sm en-in with low parse tree depth.
English exhibits lesser average constituency parser tree depth (Joshi et al., 2018) indicating (possible) structural issues. Analysis: Intuitively, length of a sentence is likely to be positively correlated with its structural complexity; a long sentence is likely to have more complex (and nested) sub-structures than a shorter one. A parser's ability to correctly identify such substructures depends on the sentence's syntactic correctness. To tease apart the relationship between sentence-length and constituency parser's depth, in Figure 4, we present the average tree depth for a given sentence length. We observe that between well-formed English, the difference is almost imperceptible. However, as the sentence length grows, the gap between tree depth obtained in social media en-in and the rest widens indicating possible structural issues. A few example long sentences with small parse-tree depth are presented in Table 9.  Figure 4: Constituency parser depth. A well-known parser is run on 10K sentences from each corpus. Average parse tree depth is presented for a given sentence length.
Observation: Our results are consistent when compared against a British English (en-gb) social media corpus.

Analysis:
We construct an additional contemporaneous corpus of 4,034,513 comments from 57,019 videos from two highly popular British news outlets (BBC and Channel 4). In Table 10