Demographic Dialectal Variation in Social Media: A Case Study of African-American English

Though dialectal language is increasingly abundant on social media, few resources exist for developing NLP tools to handle such language. We conduct a case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter. We propose a distantly supervised model to identify AAE-like language from demographics associated with geo-located messages, and we verify that this language follows well-known AAE linguistic phenomena. In addition, we analyze the quality of existing language identification and dependency parsing tools on AAE-like text, demonstrating that they perform poorly on such text compared to text associated with white speakers. We also provide an ensemble classifier for language identification which eliminates this disparity and release a new corpus of tweets containing AAE-like language.


Introduction
Owing to variation within a standard language, regional and social dialects exist within languages across the world. These varieties or dialects differ from the standard variety in syntax (sentence structure), phonology (sound structure), and the inventory of words and phrases (lexicon). Dialect communities often align with geographic and sociological factors, as language variation emerges within distinct social networks, or is affirmed as a marker of social identity.
As many of these dialects have traditionally existed primarily in oral contexts, they have historically been underrepresented in written sources. Consequently, NLP tools have been developed from text which aligns with mainstream languages. With the rise of social media, however, dialectal language is playing an increasingly prominent role in online conversational text, for which traditional NLP tools may be insufficient. This impacts many applications: for example, dialect speakers' opinions may be mischaracterized under social media sentiment analysis or omitted altogether (Hovy and Spruit, 2016). Since this data is now available, we seek to analyze current NLP challenges and extract dialectal language from online data.
Specifically, we investigate dialectal language in publicly available Twitter data, focusing on African-American English (AAE), a dialect of Standard American English (SAE) spoken by millions of people across the United States. AAE is a linguistic variety with defined syntactic-semantic, phonological, and lexical features, which have been the subject of a rich body of sociolinguistic literature. In addition to the linguistic characterization, reference to its speakers and their geographical location or speech communities is important, especially in light of the historical development of the dialect. Not all African-Americans speak AAE, and not all speakers of AAE are African-American; nevertheless, speakers of this variety have close ties with specific communities of African-Americans (Green, 2002). Due to its widespread use, established history in the sociolinguistic literature, and demographic associations, AAE provides an ideal starting point for the development of a statistical model that uncovers dialectal language. In fact, its presence in social media is attracting increasing interest for natural language processing (Jørgensen et al., 2016) and sociolinguistic (Stewart, 2014;Eisenstein, 2015;Jones, 2015) research. 1 In this work we: • Develop a method to identify demographically-aligned text and language from geo-located messages ( §2), based on distant supervision of geographic census demographics through a statistical model that assumes a soft correlation between demographics and language.
• Validate our approach by verifying that text aligned with African-American demographics follows well-known phonological and syntactic properties of AAE, and document the previously unattested ways in which such text diverges from SAE ( §3).
• Demonstrate racial disparity in the efficacy of NLP tools for language identification and dependency parsing-they perform poorly on this text, compared to text associated with white speakers ( §4, §5).
• Improve language identification for U.S. online conversational text with a simple ensemble classifier using our demographicallybased distant supervision method, aiming to eliminate racial disparity in accuracy rates ( §4.2).
• Provide a corpus of 830,000 tweets aligned with African-American demographics.

Identifying AAE from Demographics
The presence of AAE in social media and the generation of resources of AAE-like text for NLP tasks has attracted recent interest in sociolinguistic and natural language processing research; Jones well-known sociolinguistics hypotheses about AAE. Both, however, find AAE-like language on Twitter through keyword searches, which may not yield broad corpora reflective of general AAE use. More recently, Jørgensen et al. (2016) generated a large unlabeled corpus of text from hip-hop lyrics, subtitles from The Wire and The Boondocks, and tweets from a region of the southeast U.S. While this corpus does indeed capture a wide variety of language, we aim to discover AAE-like language by utilizing finer-grained, neighborhood-level demographics from across the country.
Our approach to identifying AAE-like text is to first harvest a set of messages from Twitter, cross-referenced against U.S. Census demographics ( §2.1), then to analyze words against demographics with two alternative methods, a seedlist approach ( §2.2) and a mixed-membership probabilistic model ( §2.3).

Twitter and Census data
In order to create a corpus of demographicallyassociated dialectal language, we turn to Twitter, whose public messages contain large amounts of casual conversation and dialectal speech (Eisenstein, 2015). It is well-established that Twitter can be used to study both geographic dialectal varieties 2 and minority languages. 3 Some methods exist to associate messages with authors' races; one possibility is to use birth record statistics to identify African-American-associated names, which has been used in (non-social media) social science studies (Sweeney, 2013;Bertrand and Mullainathan, 2003). However, metadata about authors is fairly limited on Twitter and most other social media services, and many supplied names are obviously not real.
Instead, we turn to geo-location and induce a distantly supervised mapping between authors and the demographics of the neighborhoods they live in (O'Connor et al., 2010a;Eisenstein et al., 2011b;Stewart, 2014). We draw on a set of geo-located Twitter messages, most of which are sent on mobile phones, by authors in the U.S. in 2013. (These are selected from a general archive of the "Gar-denhose/Decahose" sample stream of public Twitter messages (Morstatter et al., 2013)). Geolocated users are a particular sample of the userbase (Pavalanathan and Eisenstein, 2015), but we expect it is reasonable to compare users of different races within this group.
We look up the U.S. Census blockgroup geographic area that the message was sent in; blockgroups are one of the smallest geographic areas defined by the Census, typically containing a population of 600-3000 people. We use race and ethnicity information for each blockgroup from the Census' 2013 American Community Survey, defining four covariates: percentages of the population that are non-Hispanic whites, non-Hispanic blacks, Hispanics (of any race), and Asian. 4 Finally, for each user u, we average the demographic values of all their messages in our dataset into a length-four vector π (census) u . Under strong assumptions, this could be interpreted as the probability of which race the user is; we prefer to think of it as a rough proxy for likely demographics of the author and the neighborhood they live in.
Messages were filtered in order to focus on casual conversational text; we exclude tweets whose authors had 1000 or more followers, or that (a) contained 3 or more hashtags, (b) contained the strings "http", "follow", or "mention" (messages designed to generate followers), or (c) were retweeted (either containing the string "rt" or marked by Twitter's metadata as re-tweeted).
Our initial Gardenhose/Decahose stream archive had 16 billion messages in 2013; 90 million were geo-located with coordinates that matched a U.S. Census blockgroup. 59.2 million tweets from 2.8 million users remained after pre-processing; each user is associated with a set of messages and averaged demographics π (census) u .

Direct Word-Demographic Analysis
Given a set of messages and demographics associated with their authors, a number of methods could be used to infer statistical associations between language and demographics.
Direct word-demographic analysis methods use the π (census) u quantities to calculate statistics at the word level in a single pass. An intuitive approach 4 See appendix for additional details. is to calculate the average demographics per word. For a token in the corpus indexed by t (across the whole corpus), let u(t) be the author of the message containing that token, and w t be the word token. The average demographics of word type w is: 5 t 1{w t = w} We find that terms with the highest π w,AA values (denoting high average African-American demographics of their authors' locations) are very non-standard, while Stewart (2014) and Eisenstein (2013) find large π w,AA associated with certain AAE linguistic features.
One way to use the π w,k values to construct a corpus is through a seedlist approach. In early experiments, we constructed a corpus of 41,774 users (2.3 million messages) by first selecting the n = 100 highest-π w,AA terms occurring at least m = 3000 times across the data set, then collecting all tweets from frequent authors who have at least 10 tweets and frequently use these terms, defined as the case when at least p = 20% of their messages contain at least one of the seedlist terms. Unfortunately, the n, m, p thresholds are ad-hoc.

Mixed-Membership
Demographic-Language Model The direct word-demographics analysis gives useful validation that the demographic information may yield dialectal corpora, and the seedlist approach can assemble a set of users with heavy dialectal usage. However, the approach requires a number of ad-hoc thresholds, cannot capture authors who only occasionally use demographically-aligned language, and cannot differentiate language use at the message-level. To address these concerns, we develop a mixed-membership model for demographics and language use in social media. The model directly associates each of the four demographic variables with a topic; i.e. a unigram language model over the vocabulary. 6 The model as- sumes an author's mixture over the topics tends to be similar to their Census-associated demographic weights, and that every message has its own topic distribution. This allows for a single author to use different types of language in different messages, accommodating multidialectal authors. The messagelevel topic probabilities θ m are drawn from an asymmetric Dirichlet centered on π (census) u , whose scalar concentration parameter α controls whether authors' language is very similar to the demographic prior, or can have some deviation. A token t's latent topic z t is drawn from θ m , and the word itself is drawn from φ zt , the language model for the topic (Figure 1).
Thus the model learns demographically-aligned language models for each demographic category. The model is much more tightly constrained than a topic model-for example, if α → ∞, θ becomes fixed and the likelihood is concave as a function of φ-but it still has more joint learning than a direct calculation approach, since the inference of a messages' topic memberships θ m is affected not just by the Census priors, but also by the language used. A tweet written by an author in a highly AA neighborhood may be inferred to be non-AAE-aligned if it uses non-AAE-associated terms; as inference proceeeds, this information is used to learn sharper language models.
We fit the model with collapsed Gibbs sampling (Griffiths and Steyvers, 2004) with repeated sample updates for each token t in the corpus, where N wk is the number of tokens where word w occurs under topic z = k, N mk is the number of tokens in the current message with topic k, etc.; all counts exclude the current t position. We observed convergence of the log-likelihood within 100 to 200 iterations, and ran for 300 total. 7 We average together count tables from the last 50 Gibbs samples for analysis of posterior topic memberships at the word, message, and user level; for example, the posterior probability a particular user u uses topic k, P (z = k | u), can be calculated as the fraction of tokens with topic k within messages authored by u.
We considered α to be a fixed control parameter; setting it higher increases the correlations between P (z = k | u) and π (census) u,k . We view the selection of α as an inherently difficult problem, since the correlation between race and AAE usage is already complicated and imperfect at the author-level, and census demographics allow only for rough associations. We set α = 10 which yields posterior user-level correlations of P (z = AA | u) against π u,AA to be approximately 0.8.
This model has broadly similar goals as nonlatent, log-linear generative models of text that condition on document-level covariates (Monroe et al., 2008;Eisenstein et al., 2011a;Taddy, 2013). The formulation here has the advantage of fast inference with large vocabularies (since the partition function never has to be computed), and gives probabilistic admixture semantics at arbitrary levels of the data. This model is also related to topic models where the selection of θ conditions on covariates (Mimno and McCallum, 2008;Ramage et al., 2011;Roberts et al., 2013), though it is much simpler without full latent topic learning.
In early experiments, we used only two classes (AA and not AA), and found Spanish terms being included in the AA topic. Thus we turned to four race categories in order to better draw out non-AAE language. This removed Spanish terms from the AA topic; interestingly, they did not go to the Hispanic topic, but instead to Asian, along with other foreign languages. In fact, the correlation between users' Census-derived proportions of Asian populations, versus this posterior topic's proportions, is only 0.29, while the other three topics correlate to their respective Census priors in the range 0.83 to 0.87. This indicates the "Asian" topic actually functions as a background topic (at least in part).
Better modeling of demographics and non-English language interactions is interesting potential future work. By fitting the model to data, we can directly analyze unigram probabilities within the model parameters φ, but for other analyses, such as analyzing larger syntactic constructions and testing NLP tools, we require an explicit corpus of messages.
To generate a user-based AA-aligned corpus, we collected all tweets from users whose posterior probability of using AA-associated terms under the model was at least 80%, and generated a corresponding white-aligned corpus as well. In order to remove the effects of non-English languages, and given uncertainty about what the model learned in the Hispanic and Asian-aligned demographic topics, we focused only on AA-and white-aligned language by imposing the additional constraint that each user's combined posterior proportion of Hispanic or Asian language was less than 5%. Our two resulting user corpora contain 830,000 and 7.3 million tweets, for which we are making their message IDs available for further research (in conformance with the Twitter API's Terms of Service). In the rest of the work, we refer to these as the AA-and white-aligned corpora, respectively.

Linguistic Validation
Because validation by manual inspection of our AAaligned text is impractical, we turn to the wellstudied phonological and syntactic phenomena that traditionally distinguish AAE from SAE. We validate our model by reproducing these phenomena, and document a variety of other ways in which our AA-aligned text diverges from SAE.

Lexical-Level Variation
We begin by examining how much AA-and whitealigned lexical items diverge from a standard dictionary. We used SCOWL's largest wordlist with level 1 variants as our dictionary, totaling 627,685 words. 8 We calculated, for each word w in the model's vocabulary, the ratio where the p(.|.) probabilities are posterior inferences, derived from averaged Gibbs samples of the sufficient statistic count tables N wk . We selected heavily AA-and white-aligned words as those where r AA (w) ≥ 2 and r white (w) ≥ 2, respectively. We find that while 58.2% of heavily white-aligned words were not in our dictionary, fully 79.1% of heavily AA-aligned words were not. While a high number of out-of-dictionary lexical items is expected for Twitter data, this disparity suggests that the AA-aligned lexicon diverges from SAE more strongly than the white-aligned lexicon.

Internet-Specific Orthographic Variation
We performed an "open vocabulary" unigram analysis by ranking all words in the vocabulary by r AA (w) and browsed them and samples of their usage. Among the words with high r AA , we observe a number of Internet-specific orthographic variations, which we separate into three types: abbreviations (e.g. llh, kmsl), shortenings (e.g. dwn, dnt), and spelling variations which do not correlate to the word's pronunciation (e.g. axx, bxtch). These variations do not reflect features attested in the literature; rather, they appear to be purely orthographic variations highly specific to AAE-speaking communities online. They may highlight previously unknown linguistic phenomena; for example, we observe that thoe (SAE though) frequently appears in the role of a discourse marker instead of its standard SAE usage (e.g. Girl Madison outfit THOE). This new use of though as a discourse marker, which is difficult to observe using the SAE spelling amidst many instances of the SAE usage, is readily identifiable in examples containing the thoe variant. Thus, nonstandard spellings provide valuable windows into a variety of linguistic phenomena.
In the next section, we turn to variations which do appear to arise from known phonological processes.

Phonological Variation
Many phonological features are closely associated with AAE (Green, 2002). While there is not a perfect correlation between orthographic variations and people's pronunciations, Eisenstein (2013) shows that some genuine phonological phenomena, including a number of AAE features, are accurately reflected in orthographic variation on social media. We therefore validate our model by verifying that  (Jørgensen et al., 2015;Jones, 2015). These variations display a range of attested AAE phonological features, such as derhotacization (e.g. brotha), deletion of initial g and d (e.g. iont), and realization of voiced th as d (e.g. dey) (Rickford, 1999). Table 1 shows the top five of these words by their r AA (w) ratio. For 30 of the 31 words, r ≥ 1, and for 13 words, r ≥ 100, suggesting that our model strongly identifies words displaying AAE phonological features with the AA topic. The sole exception is the word brotha, which appears to have been adopted into general usage as its own lexical item.

Syntactic Variation
We further validate our model by verifying that it reproduces well-known AAE syntactic constructions, investigating three well-attested AAE aspectual or preverbal markers: habitual be, future gone, and completive done (Green, 2002). Table 2 shows examples of each construction.
To search for the constructions, we tagged the corpora using the ARK Twitter POS tagger (Gimpel et al., 2011;Owoputi et al., 2013), 9 which Jørgensen et al. (2015 show has similar accuracy rates on both AAE and non-AAE tweets, unlike other POS taggers. We searched for each construction by searching for sequences of unigrams and POS tags characterizing the construction; e.g. for habitual be we searched for the sequences O-be-V and O-b-V. Nonstandard spellings for the unigrams in the patterns were identified from the ranked analysis of §3.2.

Construction
Example Ratio O-be/b-V I be tripping bruh 11.94 gone/gne/gon-V Then she gon be single Af

14.26
done/dne-V I done laughed so hard that I'm weak 8.68  terior probability of AA. We split all messages into deciles based on the messages' posterior probability of AA. From each decile, we sampled 200,000 messages and calculated the proportion of messages containing the three syntactic constructions.
For all three constructions, we observed the clear pattern that as messages' posterior probabilities of AA increase, so does their likelihood of containing the construction. Interestingly, for all three constructions, frequency of usage peaks at approximately the [0.7, 0.8) decile. One possible reason for the decline in higher deciles might be tendency of high-AA messages to be shorter; while the mean number of tokens per message across all deciles in our samples is 9.4, the means for the last two deciles are 8.6 and 7.1, respectively.

Evaluation of Existing Classifiers
Language identification, the task of classifying the major world language in which a message is written, is a crucial first step in almost any web or social media text processing pipeline. For example, in order to analyze the opinions of U.S. Twitter users, one might throw away all non-English messages before running an English sentiment analyzer. . For Arabic dialect classification, work has developed corpora in both traditional and Romanized script (Cotterell et al., 2014;Malmasi et al., 2015) and tools that use n-gram and morphological analysis to identify code-switching between dialects and with English (Elfardy et al., 2014).
We take the perspective that since AAE is a dialect of American English, it ought to be classified as English for the task of major world language identification. Lui and Baldwin (2012) develop langid.py, one of the most popular open source language identification tools, training it on over 97 languages from texts including Wikipedia, and evaluating on both traditional corpora and Twitter messages. We hypothesize that if a language identification tool is trained on standard English data, it may exhibit disparate performance on AA-versus whitealigned tweets. Since language identifiers are typically based on character n-gram features, they may get confused by the types of lexical/orthographic di-vergences seen in §3. To evaluate this hypothesis, we compare the behavior of existing language identifiers on our subcorpora.
We test langid.py as well as the output of Twitter's in-house identifier, whose predictions are included in a tweet's metadata (from 2013, the time of data collection); the latter may give a language code or a missing value (unk or an empty/null value). We record the proportion of non-English predictions by these systems; Twitter-1 does not consider missing values to be a non-English prediction, and Twitter-2 does.
We noticed emojis had seemingly unintended consequences on langid.py's classifications, so removed all emojis by characters from the relevant Unicode ranges. We also removed @-mentions.
User-level analysis We begin by comparing the classifiers' behavior on the AA-and white-aligned corpora. Of the AA-aligned tweets, 13.2% were classified by langid.py as non-English; in contrast, 7.6% of white-aligned tweets were classified as such. We observed similar disparities for Twitter-1 and Twitter-2, illustrated in Table 3.
It turns out these "non-English" tweets are, for the most part, actually English. We sampled and annotated 50 tweets from the tweets classified as non-English by each run. Of these 300 tweets, only 3 could be unambiguously identified as written in a language other than English.
Message-level analysis We examine how a message's likelihood of being classified as non-English varies with its posterior probability of AA. As in §3.4, we split all messages into deciles based on the messages' posterior probability of AA, and predicted language identifications on 200,000 sampled messages from each decile.
For all three systems, the proportion of messages classified as non-English increases steadily as the messages' posterior probabilities of AA increase. As before, we sampled and annotated from the tweets classified as non-English, sampling 50 tweets from each decile for each of the three systems. Of the 1500 sampled tweets, only 13 (∼0.87%) could be unambiguously identified as being in a language other than English.

Adapting Language Identification for AAE
Natural language processing tools can be improved to better support dialects; for example, Jørgensen  2016) use domain adaptation methods to improve POS tagging on AAE corpora. In this section, we contribute a fix to language identification to correctly identify AAE and other social media messages as English.

Ensemble Classifier
We observed that messages where our model infers a high probability of AAE, white-aligned, or "Hispanic"-aligned language almost always are written in English; therefore we construct a simple ensemble classifier by combining it with langid.py.
For a new message w, we predict its demographic-language proportionsθ via posterior inference with our trained model, given a symmetric α prior over demographic-topic proportions (see appendix for details). The ensemble classifier, given a message, is as follows: • Calculate langid.py's predictionŷ.
• Ifŷ is English, accept it as English.
• Ifŷ is non-English, and at least one of the message's tokens are in demographic model's vocabulary: Inferθ and return English only if the combined AA, Hispanic, and white posterior probabilities are at least 0.9. Otherwise return the non-Englishŷ decision.
Another way to view this method is that we are effectively training a system on an extended Twitterspecific English language corpus softly labeled by our system's posterior inference; in this respect, it is related to efforts to collect new language-specific  For the General set these are an approximation; see text.

Evaluation
Our analysis from §4.1 indicates that this method would correct erroneous false negatives for AAE messages in the training set for the model. We further confirm this by testing the classifier on a sample of 2.2 million geolocated tweets sent in the U.S. in 2014, which are not in the training set.
In addition to performance on the entire sample, we examine our classifier's performance on messages whose posterior probability of using AA-or white-associated terms was greater than 0.8 within the sample, which in this section we will call high AA and high white messages, respectively. Our classifier's precision is high across the board, at 100% across manually annotated samples of 200 messages from each sample. 10 Since we are concerned about the system's overall recall, we impute recall (Table 4) by assuming that all high AA and high white messages are indeed English. Recall for langid.py alone is calculated by n N , where n is the number of messages predicted to be English by langid.py, and N is the total number of messages in the set. (This is the complement of Table 3, except evaluated on the test set.) We estimate the ensemble's recall as n+m N , where m = (n f lip )P (English | f lip) is the expected number of correctly changed classifications (from non-English to English) by the ensemble and the second term is the precision (estimated as 1.0). We observe the baseline system has considerable difference in recall between the groups which is solved by the ensemble.
We also apply the same calculation to the general set of all 2.2 million messages; the baseline classifies 88% as English. This is a less accurate approximation of recall since we have observed a substantial presence of non-English messages. The ensemble classifies an additional 5.4% of the messages as English; since these are all (or nearly all) correct, this reflects at least a 5.4% gain to recall.

Dependency Parser Evaluation
Given the lexical and syntactic variation of AAE compared to SAE, we hypothesize that syntactic analysis tools also have differential accuracy. Jørgensen et al. (2015) demonstrate this for part-ofspeech tagging, finding that SAE-trained taggers had disparate accuracy on AAE versus non-AAE tweets.
We assess a publicly available syntactic dependency parser on our AAE and white-aligned corpora. Syntactic parsing for tweets has received some research attention; Foster et al. (2011) create a corpus of constituent trees for English tweets, and Kong et al. (2014)'s Tweeboparser is trained on a Twitter corpus annotated with a customized unlabeled dependency formalism; since its data was uniformly sampled from tweets, we expect it may have low disparity between demographic groups.
We focus on widely used syntactic representations, testing the SyntaxNet neural network-based dependency parser (Andor et al., 2016), 11 which reports state-of-the-art results, including for web corpora. We evaluate it against a new manual annotation of 200 messages, 100 randomly sampled from each of the AA-and white-aligned corpora described in §2.3.
SyntaxNet outputs grammatical relations conforming to the Stanford Dependencies (SD) system (de Marneffe and Manning, 2008), which we used to annotate messages using Brat, 12 comparing to predicted parses for reference. Message order was randomized and demographic inferences were hidden from the annotator. To increase statistical power relative to annotation effort, we developed a partial annotation approach to only annotate edges for the root word of the first major sentence in a message. Generally, we found that that SD worked well as a descriptive formalism for tweets' syntax; we describe handling of AAE and Internet-specific non-standard issues in the appendix. We evaluate labeled recall of the annotated edges for each message set: Parser AA Wh. Difference SyntaxNet 64.0 (2.5) 80.4 (2.2) 16.3 (3.4) CoreNLP 50.0 (2.7) 71.0 (2.5) 21.0 (3.7) Bootstrapped standard errors (from 10,000 message resamplings) are in parentheses; differences are statistically significant (p < 10 −6 in both cases).
The white-aligned accuracy rate of 80.4% is broadly in line with previous work (compare to the parser's unlabeled accuracy of 89% on English Web Treebank full annotations), but parse quality is much worse on AAE tweets at 64.0%. We test the Stanford CoreNLP neural network dependency parser (Chen and Manning, 2014) using the english SD model that outputs this formalism; 13 its disparity is worse. Soni et al. (2014) used a similar parser 14 on Twitter text; our analysis suggests this approach may suffer from errors caused by the parser.

Discussion and Conclusion
We have presented a distantly supervised probabilistic model that employs demographic correlations of a dialect and its speaker communities to uncover dialectal language on Twitter. Our model can also close the gap between NLP tools' performance on dialectal and standard text.
This represents a case study in dialect identification, characterization, and ultimately language technology adaptation for the dialect. In the case of AAE, dialect identification is greatly assisted since AAE speakers are strongly associated with a demographic group for which highly accurate governmental records (the U.S. Census) exist, which we leverage to help identify speaker communities. The notion of non-standard dialectal language implies that the dialect is underrepresented or underrecognized in some way, and thus should be inherently difficult to collect data on; and of course, many other language communities and groups are not necessarily officially recognized. An interesting direction for future research would be to combine distant supervision with unsupervised linguistic models to automatically uncover such underrecognized dialectal language. where N −t,k = t =t q t (k) is the soft topic count from other tokens in the message. The final posterior mean of θ is estimated asθ k = (1/T ) t q t (k). We find, similar to Asuncion et al. (2009), that CVB0 has the advantage of simplicity and rapid convergence;θ converges to within absolute 0.001 of a fixed point within five iterations on test cases.

D Syntactic dependency annotations ( §5)
The SyntaxNet model outputs grammatical relations based on Stanford Dependencies version 3.3.0; 17 thus we sought to annotate messages with this formalism, as described in a 2013 revision to de Marneffe and Manning (2008). 18 For each message, we parsed it and displayed the output in the Brat annotation software 19 alongside an unannotated copy of the message, which we added dependency edges to. This allowed us to see the proposed analysis to improve annotation speed and conformance with the grammatical standard. For difficult cases, we parsed shortened, Standard English toy sentences to confirm what relations were intended to be used to capture specific syntactic constructs. Sometimes this clearly contradicted the annotation standards (probably due to mismatch between the annotations it was trained on versus the version of the dependencies manual we viewed); we typically deferred to the parser's interpretation in such cases. In order to save annotation effort for this evaluation, we took a partial annotation approach: for each message, we identified the root word of the first major sentence 20 in the message-typically the main verb-and annotated its immediate dependent edges. Thus for every tweet, the gold standard included one or more labeled edges, all rooted in a single token. As opposed to completely annotating all words in a message, this allowed us to cover a broader set of messages, increasing statistical power Punctuation edges (punct) were not annotated. We found discourse edges (discourse) to be difficult annotation decisions, since in many cases the dependent was debatably in a different utterance. We tended to defer to the parser's predictions when in doubt. The partial labeling approach does not penalize the parser if the annotator gives too few edges, but these issues would have to be tackled to create a full treebank in future work.

E Annotation materials
We supply our annotations with the online materials 22 as well as working notes about difficult cases. Annotations are formatted in Brat's plaintext format.