Inducing a lexicon of sociolinguistic variables from code-mixed text

Sociolinguistics is often concerned with how variants of a linguistic item (e.g., nothing vs. nothin’) are used by different groups or in different situations. We introduce the task of inducing lexical variables from code-mixed text: that is, identifying equivalence pairs such as (football, fitba) along with their linguistic code (football→British, fitba→Scottish). We adapt a framework for identifying gender-biased word pairs to this new task, and present results on three different pairs of English dialects, using tweets as the code-mixed text. Our system achieves precision of over 70% for two of these three datasets, and produces useful results even without extensive parameter tuning. Our success in adapting this framework from gender to language variety suggests that it could be used to discover other types of analogous pairs as well.


Introduction
Large social media corpora are increasingly used to study variation in informal written language (Schnoebelen, 2012;Bamman et al., 2014;Nguyen et al., 2015;Huang et al., 2016).An outstanding methodological challenge in this area is the bottomup discovery of sociolinguistic variables: linguistic items with identifiable variants that are correlated with social or contextual traits such as class, register, or dialect.For example, the choice of the term rabbit versus bunny might correlate with audience or style, while fitba is a characteristically Scottish variant of the more general British football.
To date, most large-scale social media studies have studied the usage of individual variant forms (Eisenstein, 2015;Pavalanathan and Eisenstein, 2015).Studying how a variable alternates between its variants controls better for 'Topic Bias' (Jørgensen et al., 2015), but identifying the relevant variables/variants may not be straightforward.
For example, Shoemark et al. (2017b) used a datadriven method to identify distinctively Scottish terms, and then manually paired them with Standard English equivalents, a labour intensive process that requires good familiarity with both language varieties.Our aim is to facilitate the process of curating sociolinguistic variables by providing researchers with a ranked list of candidate variant pairs, which they only have to accept or reject.
This task, which we term lexical variable induction, can be viewed as a type of bilingual lexicon induction (Haghighi et al., 2008;Zhang et al., 2017).However, while most work in that area assumes that monolingual corpora are available and labeled according to which language they belong to, in our setting there is a single corpus containing code-mixed text, and we must identify both translation equivalents (football, fitba) as well as linguistic code (football→British, fitba→Scottish).
To illustrate, here are some excerpts of tweets from the Scottish dataset analysed by Shoemark et al., with Standard English glosses in italics:1 1. need to come hame fae the football need to come home from the football 2. miss the fitba miss the football 3. awwww man a wanty go tae the fitbaw awwww man I want to go to the football The lexical variable induction task is challenging: we cannot simply classify documents containing fitba as Scottish, since the football variant may also occur in otherwise distinctively Scottish texts, as in (1).Moreover, if we start by knowing only a few variables, we would like a way to learn what other likely variables might be.Had we not known the (football, fitba) variable, we might not detect that (2) was distinctively Scottish.Our proposed system can make identifying variants quicker and also suggest variant pairs a researcher might not have otherwise considered, such as (football, fitbaw) which could be learned from tweets like (3).
Our task can also be viewed as the converse of the one addressed by Donoso and Sanchez (2017), who proposed a method to identify geographical regions associated with different linguistic codes, using pre-defined lexical variables.Also complementary is the work of Kulkarni et al. (2016), who identified words which have the same form but different semantics across different linguistic codes; here, we seek to identify words which have the same semantics but different forms.
We frame our task as a ranking problem, aiming to generate a list where the best-ranked pairs consist of words that belong to different linguistic codes, but are otherwise semantically and syntactically equivalent.Our approach is inspired by the work of Schmidt (2015) and Bolukbasi et al. (2016), who sought to identify pairs of words that exhibit gender bias in their distributional statistics, but are otherwise semantically equivalent.Their methods differ in the details but use a similar framework: they start with one or more seed pairs such as {(he, she), (man, woman)} and use these to extract a 'gender' component of the embedding space, which is then used to find and rank additional pairs.
Here, we replace the gendered seed pairs with pairs of sociolinguistic variants corresponding to the same variable, such as {(from, fae), (football, fitba)}.In experiments on three different datasets of mixed English dialects, we demonstrate useful results over a range of hyperparameter settings, with precision@100 of over 70% in some cases using as few as five seed pairs.These results indicate that the embedding space contains structured information not only about gendered usage, but also about other social aspects of language, and that this information can potentially be used as part of a sociolinguistic researcher's toolbox.

Methods
Our method consists of the following steps.2 Train word embeddings We used the Skipgram algorithm with negative sampling (Mikolov et al., 2013) on a large corpus of code-mixed text to obtain a unit-length embedding w for each word in the input vocabulary V . 3xtract 'linguistic code' component Using seed pairs S = {(x i , y i ), i = 1 . . .n}, we compute a vector c representing the component of the embedding space that aligns with the linguistic code dimension.Both Schmidt and Bolukbasi et al. were able to identify gender-biased word pairs using only a single seed pair, defining the 'gender' component as c = w she − w he .However, there is no clear prototypical pair for dialect relationships, so we average our pairs, defining 4 We experiment with the number of required seed pairs in §5.
Threshold candidate pairs From the set of all word pairs in V × V , we generate a set of candidate output pairs.We follow Bolukbasi et al. (2016) and consider only pairs whose embeddings meet a minimum cosine similarity threshold δ.We set δ automatically using our seed pairs: for each seed pair (x i , y i ) we compute cos(x i , y i ) and set δ equal to the lower quartile of the resulting set of cosine similarities.
Rank candidate pairs Next we use c to rank the remaining candidate pairs such that the top-ranked pairs are the most indicative of distinct linguistic codes, but are otherwise semantically equivalent.We follow Bolukbasi et al. (2016), 5 setting score(w i , w j ) = cos(c, w i − w j ).
Filter top-ranked pairs High dimensional embedding spaces often contain 'hub' vectors, which are the nearest neighbours of a disproportionate number of other vectors (Radovanović et al., 2010).
In preliminary experiments we found that many of our top-ranked candidate pairs included such 'hubs', whose high cosine similarity with the word vectors they were paired with did not reflect genuine semantic similarity.We therefore discard all pairs containing words that appear in more than m of the top-n ranked pairs. 6e test our methods on three pairs of language varieties: British English vs Scots/Scottish English; British English vs General American English; and General American English vs African American Vernacular English (AAVE).Within each data set, individual tweets may contain words from one or both codes of interest, and the only words with a known linguistic code (or which are known to have a corresponding word in the other code) are those in the seed pairs.
BrEng/Scottish For our first test case, we combined the two datasets collected by Shoemark et al. (2017a), consisting of complete tweet histories from Aug-Oct 2014 by users who had posted at least one tweet in the preceding year geotagged to a location in Scotland, or that contained a hashtag relating to the 2014 Scottish Independence referendum.The corpus contains 9.4M tweets.
For seeds, we used the 64 pairs curated by Shoemark et al. (2017b).Half are discourse markers or open-class words (dogs, dugs), (gives, gees) and half are closed-class words (have, hae), (one, yin).The full list is included in the Supplement.
BrEng/GenAm For our next test case we recreated the entire process of collecting data and seed variables from scratch.We extracted 8.3M tweets geotagged to locations in the USA from a three-year archive of the public 1% sample of Twitter (1 Jul 2013-30 Jun 2016).All tweets were classified as English by langid.py(Lui and Baldwin, 2012), none are retweets, none contain URLs or embedded media, and none are by users with more than 1000 friends or followers.We combined this data with a similarly constructed corpus of 1.7M tweets geotagged to the UK and posted between 1 Sep 2013 and 30 Sep 2014.
To create the seed pairs, we followed Shoemark et al. ( 2017b) and used the Sparse Additive Generative Model of Text (SAGE) (Eisenstein et al., 2011) to identify the terms that were most distinctive to UK or US tweets.However, most of these terms turned out to represent specific dialects within each country, rather than the standard BrEng or GenAm dialects (we discuss this issue further below).We therefore manually searched through the UK terms to identify those that are standard BrEng and diframeters for each language pair: m = 20, n = 20k for BrEng/Scottish; m = 5, n = 5k for GenAm/AAVE; and m = 10, n = 5k for BrEng/GenAm.fer from GenAm by spelling only, and paired each one with its GenAm spelling variant, e.g.(color, colour), (apologize, apologise), (pajamas, pyjamas).This process involved looking through thousands of words to identify only 27 pairs (listed in the Supplement), which is a strong motivator for our proposed method to more efficiently increase the number of pairs.GenAm/AAVE While creating the previous dataset, we noticed that many of the terms identified by SAGE as distinctively American were actually from AAVE.To create our GenAm/AAVE seed pairs, we manually cross-referenced the most distinctively 'American' terms with the AAVE phonological processes described by Rickford (1999).We then selected terms that reflected these processes, paired with their GenAm equivalents, e.g.(about, bou), (brother, brudda).The full list of 19 open-class and 20 closed-class seed pairs is included in the Supplement.

Evaluation Procedure
We evaluate our systems using Precision@K, the percentage of the top K ranked word pairs judged to be valid sociolinguistic variables.We discard any seed pairs from the output before computing precision.Since we have no gold standard translation dictionaries for our domains of interest, each of the top-K pairs was manually classified as either valid or invalid by the first author.
For a pair to be judged as valid, (a) each member must be strongly associated with one or the other language variety, (b) they must be referentially, functionally, and syntactically equivalent, and (c) the two words must be ordered correctly according to their language varieties, e.g. if the seeds were (BrEng, GenAm) pairs, then the BrEng words should also come first in the top-K output pairs.Evaluation judgements were based on the author's knowledge of the language varieties in question; for unfamiliar terms, tweets containing the terms were sampled and manually inspected, and cross-referenced with urbandictionary.comand/or existing sociolinguistic literature.
Our strict criteria exclude pairs like (dogs, dug) which differ in their inflection, or (quid, dollar) whose referents are distinct but are equivalent across cultures.In many cases it was difficult to judge whether or not a pair should be accepted, such as when not all senses of the words were interchangable, e.g.BrEng/GenAm (folk, folks) works for the 'people' sense of folk, but not the adjectival sense: (folk music, *folks music).The BrEng/GenAm dataset also yielded many pairs of words that exhibit different frequencies of usage in the two countries, but where both words are part of both dialects, such as (massive, huge), (vile, disgusting), and (horrendous, awful).We generally marked these as incorrect, although the line between these pairs and clear-cut lexical alternations is fuzzy.For some applications, it may be desirable to retrieve pairs like these, in which case the precision scores we report here are very conservative.

Results and Discussion
We started by exploring how the output precision is affected by the hyperparameters of the word embedding model: the number of embedding dimensions, size of the context window, and minimum frequency below which words are discarded.Results (Figure 1) show that the context window size does not make much difference and that the best scores for each language use a minimum frequency threshold of 50-100.The main variability seems to be in the optimal number of dimensions, which is much higher for the BrEng/Scottish data set than for GenAm/AAVE.Although the precision varies considerably, it is over 40% for most settings, which means a researcher would need to manually check only a bit over twice as many pairs as needed for a study, rather than sifting through a much larger list of individual words and trying to come up with the correct pairs by hand.Results for BrEng/GenAm are worse than for the other two datasets, for reasons which become clear when we look at the output.
Table 1 shows the top 10 generated pairs for each pair of language varieties, using the best hyperparameters for each of the datasets.The top 100 are given in the Supplement.According to our strict evaluation criteria, many of the output pairs for the BrEng/GenAm dataset were scored as incorrect.However, most of them are actually sen- sible, and examples of the kinds of grey areas and cultural analogies (e.g., (amsterdam, vegas), (bbc, cnn)) that we discussed in §4.These types of pairs likely predominate because BrEng and GenAm are both standardized dialects with very little difference at the lexical level, so cultural analogies and frequency effects are the most salient differences.To show how many pairs can be identified effectively, Figure 2 plots Precision@K as a function of K ∈ {1. .300}.For BrEng/Scottish and GenAm/AAVE, more than 70% of the top-100 ranked word pairs are valid.Precision drops off fairly slowly, and is still at roughly 50% for these two datasets even when returning 300 pairs.
To assess the contribution of the 'linguistic code' component, we compared the performance of our system with a naïve baseline which does not use the 'linguistic code' vector c at all.Since translation equivalents such as fitba and football are likely Baseline Our Method BrEng / Scottish 0.00 0.71 BrEng / GenAm 0.04 0.32 GenAm / AAVE 0.08 0.74 Table 2: Precision@100 for our method and the baseline for each language pair.to be very close to one another in the embedding space, it is worth checking whether they can be identified on that basis alone.The baseline ranks all unordered pairs of words in the vocabulary just by their cosine similarity, cos(w i , w j ).Since this baseline gives us no indication of which word belongs to which language variety, we evaluated it only on its ability to correctly identify translation equivalents (i.e. using criteria (a) and (b), see §4), and gave it a free pass on assigning the variants to the correct linguistic codes (criterion (c)).Results are in Table 2. Despite its more lenient evaluation criteria, the baseline performs very poorly.Perhaps if we were starting with a pre-defined set of words from one language variety which were known to have equivalents in the other, then simply returning their nearest neighbours might be sufficient.However, in this more difficult setting where we lack prior knowledge about which words belong to our codes of interest, an additional signal clearly is needed.
Finally, we examined how performance depends on the particular seed pairs we used and the number of seed pairs.Using the BrEng/Scottish and GenAm/AAVE datasets, we subsampled between 1 and 30 seed pairs from our original sets.Over 10 random samples of each size, we found similar average performance using just 5 seed pairs as when using the full original sets (see Figure 3).Performance increased slightly when using only open-class seed pairs: P@100 rose to 0.77 for Scottish/BrEng and 0.75 for GenAm/AAVE (cf.0.71 and 0.74 using all the original seed pairs).These results indicate our method is robust to the number and quality of seed pairs.

Conclusion
Overall, our results demonstrate that sociolinguistic information is systematically encoded in the word embedding space of code-mixed text, and that this implicit structure can be exploited to identify sociolinguistic variables along with their linguistic code.Starting from just a few seed variables, a simple heuristic method is sufficient to identify a large number of additional candidate pairs with precision of 70% or more.Results are somewhat sensitive to different hyperparameter settings but even non-optimal settings produce results that are likely to save time for sociolinguistic researchers.
Although we have so far tested our system only on varieties of English7 , we expect it to perform well with other pairs of language varieties which have a lot of vocabulary overlap or are frequently code-mixed within sentences or short documents, including code-mixed languages as well as dialects.This framework may also be useful for identifying variation across genres or registers.
Tables 1 to 3 list the full sets of seed variables that were used in our experiments.

B.1 Alternative methods for extracting 'linguistic code' component
In addition to the very simple method presented in the main paper (which we will refer to here as MEANSDIFF), we tested two additional methods for combining multiple seed pairs to identify a single 'linguistic code' component.
The first is a version of the method Bolukbasi et al. used in their full debiasing algorithm, which we call OFFSETSPCA. 1 We compute the mean m i of each seed pair (x i , y i ) and the offset vectors m i − x i and m i − y i .We then apply PCA to the resulting collection of offsets, and set c equal to the first principal component.We also consider a simplere variant, INDIVPCA, wherein we define c to be the first principal component of the set of all our individual seed vectors; no pairwise information is used.

B.2 Alternative methods for ranking candidate word pairs
As well as the ranking method that we used throughout the main paper (which we adopted from Bolukbasi et al. (2016) and will refer to here as DIFFSIM), we also tested an alternative ranking method devised by Schmidt (2015).Schmidt's method, which we will call REJECT, first 'rejects' c from each word w i in a candidate pair, projecting w i onto the plane orthogonal to c, as w It then defines score(w i , w j ) as the ratio between the similarity of the rejected vectors and the originals:

B.3 Results of comparison
Table 4 compares the different methods outlined above for extracting the 'linguistic code' component and ranking the candidate word pairs.There is little difference in P@100 between OFFSET-SPCA, INDIVPCA, and MEANSDIFF except on the most difficult dataset (BrEng/GenAm), where INDIVPCA (the only method that doesn't explicitly use pairwise information) fails entirely.DIFF-SIM and REJECT perform similarly except on the AAVE/GenAm dataset, providing some evidence that DIFFSIM is more robust.

C Output from our system
Tables 8 to 10 display the top-100 ranked variant pairs generated by our system for each language pair.Although there are admittedly some inconsistencies in the kinds of pairs that were accepted/rejected, our system clearly returned many more clear-cut lexical alternations for BrEng/Scottish and GenAm/AAVE than for BrEng/GenAm.That being said, a lot of the BrEng/GenAm pairs we rejected do accurately 2 However, recall that we tuned our embedding hyperparameters using DIFFSIM; another setting might yield better results for REJECT.reflect cultural differences between the UK and USA.

D Output from baseline
Tables 5 to 7 show the top-10 ranked variant pairs generated by the simple cosine-similarity baseline for each language pair.None of these were judged to be correct.

Figure 2 :
Figure2: Precision@K from K=1 to 300 for each language variety pair.

Figure 3 :
Figure 3: Mean Precision@K curves for different sized samples from the original seed pair lists.Each curve is averaged across 10 random samples of n seed pairs, for n ∈ {1, 5, 10, 20, 30}.

Table 1 :
Top 10 ranked variables for each language pair (invalid variables in italics).

Table 5 :
Top 10 generated variables by the baseline for BrEng/Scottish.

Table 6 :
Top 10 generated variables by the baseline for BrEng/GenAm.

Table 7 :
Top 10 generated variables by the baseline for GenAm/AAVE.

Table 8 :
Top 100 generated variant pairs for British English vs Scots/Scottish English.Pairs we accepted are in bold, and those we rejected are in italics.

Table 9 :
Top 100 generated variant pairs British English vs General American English.Pairs we accepted are in bold, and those we rejected are in italics.