Aye or naw, whit dae ye hink? Scottish independence and linguistic identity on social media

Political surveys have indicated a relationship between a sense of Scottish identity and voting decisions in the 2014 Scottish Independence Referendum. Identity is often reflected in language use, suggesting the intuitive hypothesis that individuals who support Scottish independence are more likely to use distinctively Scottish words than those who oppose it. In the first large-scale study of sociolinguistic variation on social media in the UK, we identify distinctively Scottish terms in a data-driven way, and find that these terms are indeed used at a higher rate by users of pro-independence hashtags than by users of anti-independence hashtags. However, we also find that in general people are less likely to use distinctively Scottish words in tweets with referendum-related hashtags than in their general Twitter activity. We attribute this difference to style shifting relative to audience, aligning with previous work showing that Twitter users tend to use fewer local variants when addressing a broader audience.


Introduction
A central idea from sociolinguistics is that people's social identity is reflected in their use of language, and that people modulate their use of language in order to present particular identities in different situations. The recent availability of social media data has raised interest in confirming and extending these results using large scale datasets. For example, Twitter data has been used to examine patterns of regional variation in general US English (Doyle, 2014;Huang et al., 2015), African American English (Jones, 2015), and global Spanish (Gonçalves and Sánchez, 2014), and to study variation associated with factors such as race/ethnicity (Jones, 2015;Blodgett et al., 2016;Jørgensen et al., 2015) and gender (Bamman et al., 2014). These studies have shown that tweets mirror spoken language in many ways, such as displaying dialect variation not only in the use of distinct lexical items, but also in the use of non-standard spellings to indicate nonstandard pronunciation-in fact, these spellings even reflect the phonological processes found in spoken language (Eisenstein, 2015). There is also evidence that, as in spoken language, individuals may shift their style of language in response to the audience. In particular, studies have found that when the expected audience of a tweet is larger, Americans use fewer non-standard and local words (Pavalanathan and Eisenstein, 2015) and Dutch bilingual speakers of a minority language are more likely to use Dutch rather than their other language (Nguyen et al., 2015). A small-scale case study of a single Scottish Twitter user also provides preliminary evidence that users may modulate their production of regional variants according to the topic of the tweet (Tatman, 2015).
Here we present the first large-scale sociolinguistic study of British tweets, and the first to examine the relationship between sociolinguistic variation and political views using social media data. We use a large corpus of tweets to examine the relationship between users' linguistic choices and their views about the 2014 Scottish independence referendum. The referendum (on whether Scotland should leave the UK) generated considerable political discussion and an unprecedented turnout of 84.6% of the electorate, with the 'No' (anti-independence) side taking 55.3% of the vote. The 2013 Scottish Social Attitudes Survey (ScotCen, 2013) showed a clear correlation between national identity and voting intentions (53% of those who identified as 'Scottish not British' said they intended to vote 'Yes' to independence, vs. just 5% of those who identified as 'British not Scottish'), and there was much discussion in the popular press about the relationship between a sense of Scottish identity and support for Scottish sovereignty.
Although this recent discussion was not centered on language, there is a long history of scholarly discourse connecting the use of the Scots language 1 and sociolinguistic and political identity (Grant, 1931;Mcafee, 1985;Corbett et al., 2003). If this connection still holds today, then we might expect to find that those on the 'Yes' side of the debate use more identifiably Scottish language than those on the 'No' side. We might also expect to find some modulation of Scottish language use depending on whether users are discussing the referendum or not.
To examine these questions, we used a datadriven approach to identify linguistic terms that are used more in Scotland than in the rest of the UK. The identified terms include uniquely Scots words that are attested in Scots literature dating back to the 1600s and earlier, contemporary regional colloquialisms, spelling variants of Standard English words which reflect Scottish pronunciations, and acronyms used as shorthand for distinctive Scottish phrases. From these, we selected variables for which users can produce either a Standard English or Scottish variant (e.g., DO vs. DAE). We then classified users as pro-or anti-independence based on the referendum-related hashtags they used and asked whether these two groups use Scottish variants at different rates. We found that the proindependence group did use Scottish variants significantly more than the anti-independence group, although the overall rate of Scottish variants is very low amongst all users.
Next, we compared the use of Scottish variants in tweets containing referendum-related hashtags to their use in other tweets. If users are aiming to project their Scottish identity as part of politi-cal discourse, then we might expect greater use of Scottish variants in referendum tweets than in nonreferendum tweets. However, previous studies have suggested that non-standard and local variants are used less frequently in tweets containing hashtags, which typically have a larger audience than other tweets (Pavalanathan and Eisenstein, 2015). This effect would predict the opposite result-a lower use of Scottish variants in tweets with referendum hashtags-and indeed this is the result we found. So it appears that although pro-independence users do make greater use of Scottish variants overall, they do not increase their Scottish usage when engaging in broad-audience political discourse.
To summarize, the contributions of our paper are: (1) The first large-scale study of dialect variation on twitter in the UK. We show that in addition to using Scots in speech and some literary genres such as poetry, people are using Scots in informal public writing. The data-driven approach enables us to identify Scotland-specific lexical items without relying on pre-conceived notions of which variables to look for (cf. Tatman, 2015), and reveals that in addition to using attested Scots vocabulary, Twitter users appear to be creatively adapting to the medium with their use of acronyms for distinctly Scottish turns of phrase.
(2) The first study connecting sociolinguistic variables to political stance using social media data, showing that pro-independence users have a higher rate of Scottish usage. (3) Further evidence of Pavalanathan and Eisenstein's (2015) claim that Twitter users modulate their language according to the audience, with local variants being less likely in tweets directed to larger audiences.

Context
'Scots' refers to the group of dialects historically spoken in the Lowlands of Scotland. While Scots has Anglo-Scandinavian origins in common with English, by the 16th century its pronunciation, vocabulary, and literary norms had considerably diverged from those of English, and Scots had become established as the prestige language in Scotland (Kay, 1988). 2 However, following the Union of Crowns in 1603, when King James VI of Scotland acceded to the thrones of England and Ireland, he and his court began to adopt English norms in their writing. After the Union of Parliaments in 1707, English firmly replaced Scots as the language of serious or elevated discourse in Scotland (Grant, 1931). While some people still use distinctive elements of Scots in their speech, until recently the average Scottish person's exposure to written Scots would have been largely confined to a select few literary domains such as poetry and comic narrative (Corbett et al., 2003). However, social media has given rise to a new genre of casual, communicative writing that is potentially visible to large and diverse audiences, providing both a platform and an impetus to express one's identity through the use of written language. Below, we provide three example tweets (each from a different user) which contain orthographic representations of Scots vocabulary and/or Scottish English pronunciation. Standard English variants of Scottish terms are provided in italics.
( (3) #fuckoffscotland hud on we will fuck off but afore we dae eh challenge ye tae a square go ya queen loving DIDDY doughnut Sasijs YUP-TAE #fuckoffscotland hold on we will fuck off but before we do I challenge you to a fair fight you queen loving fools. What are you doing!?

Data
Our data was drawn from the Sample endpoint of Twitter's Streaming API (a.k.a. the 'Spritzer'), which provides a random 1% sample of all public tweets in near real-time. We started with all tweets streamed from the Spritzer between 1st September 2013 and 30th September 2014. These dates cover a year of activity leading up to the referendum, as well as the day the vote took place (18 September 2014), and immediate reactions. We used a language classifier (Lui and Baldwin, 2012) to filter out non-English tweets, yielding an initial dataset of 629,431,509 tweets. 3 Because we are interested in the linguistic choices that individuals make in various contexts, we took steps to remove tweets which were not originally authored by the individual who posted them. Retweets (tweets which are verbatim copies of other tweets) were identified by a case-insensitive search for the token 'RT', and discarded. Quote tweets (tweets which contain verbatim copies of other tweets, but are augmented with original comments) were dealt with by discarding any text between double quotation marks, but retaining the remainder of the tweet. From this initial dataset we extracted three overlapping subsets: The Geotagged-UK (GU) dataset contains all tweets geotagged to a location in the United Kingdom (1,654,204 tweets by 446,923 distinct users).
The Geotagged-Scotland (GS) dataset contains all tweets geotagged to a location in Scotland (166,992 tweets by 40,861 distinct users).
The Indyref Tweets (IT) dataset consists of tweets containing hashtags relating to the 2014 Scottish Independence Referendum.
To construct the IT dataset, we first created a list of relevant hashtags, starting with the following five seed hashtags: #IndyRef, #VoteYes, #VoteNo, #YesScotland, #BetterTogether. 4 For each of these five seeds, we extracted from our initial filtered dataset a list of all tweets by any user who used the seed hashtag. We identified the 100 most frequent hashtags in each of these five lists of tweets, and manually discarded all hashtags which were unrelated to the referendum, as well as those which were highly ambiguous (e.g., #Indy, which sometimes refers to the referendum, but also commonly refers to a genre of music). The resulting list of referendum-related hashtags is given in Table 1.
Next, we extracted all tweets from our initial dataset which contain at least one of the hashtags on this list, yielding 77,708 tweets by 26,019 distinct users. We then applied a heuristic to filter out tweets produced by bots and spammers: for even tweets such as example (3) in §2 are assigned a very high probability of being English by the filter. Perhaps other tweets with many Scottish terms were filtered out, in which case we will underestimate the probability that users choose Scottish variants. However this issue should not cause us to find differences in use between different groups where there are none. 4 'Yes Scotland' and 'Better Together' are the names of the principal organisations representing the Yes and No vote campaigns, respectively. each user in the IT dataset for whom we had at least 5 tweets in the initial dataset, we computed the proportion of their tweets that contain URLs, and discarded users for whom this proportion was in the 90th percentile. This step filtered out 11,443 tweets by 1389 users.
Note that seven of the hashtags in Table 1 (#voteyes, #bettertogether, #nothanks, #voteno, #yes2014, #letsstaytogether, and #yesvote) are occasionally used in contexts unrelated to the Scottish Independence Referendum (e.g. #bettertogether can also refer to interpersonal relationships). However, they are distinctive enough that if a user has also used hashtags which are unambiguously related to the referendum, then it seems reasonable to assume that their usage of these potentiallyambiguous hashtags relates to the referendum too. Therefore, in order for a tweet containing one of these seven hashtags to be retained in the Indyref dataset, we required that its author had also used at least one other hashtag from Table 1. This criterion filtered out a further 6601 tweets by 6041 distinct users, such that the final IT dataset contains 59,664 tweets by 18,589 distinct users.

Identifying distinctively Scottish vocabulary on Twitter
We wish to identify terms that are more likely to be used by Twitter users in Scotland than in the rest of the UK. We follow the method of Pavalanathan and Eisenstein (2015), who used the Sparse Additive Generative Model of Text (SAGE) framework (Eisenstein et al., 2011) to identify tweet terms associated with metropolitan areas in the United States. SAGE models deviations in the log-frequencies of terms in a corpus of interest (here, the GS dataset) with respect to their log-frequencies in some "background" corpus (here, the GU dataset). The estimated deviations are regularized to avoid overstating the importance of deviations in the frequencies of rare words. Here, we use a publicly available implementation of SAGE 5 to obtain log-frequency deviation estimates for all terms which occur at least fifty times in the GU dataset, excluding hashtags, mentions, URLs, and stopwords. The terms with the highest estimates are those which are most distinctive to tweets geo-located in Scotland. 5 https://github.com/jacobeisenstein/jos-gender-2014/

Scotland-specific terms
Unsurprisingly, many of the Scotland-specific terms are proper nouns which are topically associated with Scotland, such as Scottish placenames, political figures, and sports personalities. There are also several common nouns (e.g. 'devolution', 'bagpipes') and verbs (e.g. 'canvass', 'invade') which are strongly associated with the political or cultural climate in Scotland. These terms occur with greater relative frequency in the GS dataset simply because their referents are discussed with greater relative frequency; not because they are distinct from the terms that people in the rest of the UK use to index those referents. However, there are also many terms with high log-frequency deviations that are linguistically distinctive. To isolate such terms, we began with the 400 terms with the highest estimated deviations, and then manually filtered this list, discarding Standard English words, proper nouns, numerals, and non-standard terms which had clear topical associations (e.g. 'devo': an abbreviation for 'devolution'; 'hh': an acronym for 'Hail Hail', a football chant used by supporters of Celtic F.C.). The remaining 113 distinctively Scottish terms are listed in Table 2. Almost three fourths of these terms are attested in the Scottish National Dictionary (SND) (Grant and Murison, 1931) or its online supplement (Scottish Language Dictionaries, 2004), which catalogue words that are distinctive to Scots (i.e. those which are not used, or are used differently, in Standard English), covering the period from the 1700s up to the present day. Many are also attested in the Dictionary of the Older Scottish Tongue (Aitken et al., 1990), which catalogues the entire vocabulary of Scots from the 1100s to the late 1600s. Of the attested Scots words, some are unique to Scots, e.g. BAIRNS ('sons/daughters'), GREETIN ('weeping'); some are cognates with English words that have fallen out of common usage, e.g. CRABBIT ('crabbed'; 'illtempered'), FEART ('feared'; 'frightened/timid'); some are cognates with English words but have a wider range of senses, e.g. HUNNERS is cognate with'hundreds', but used more generally to mean 'lots' as in "love you hunners", "there was hunners to do"; and many differ only in form from their English cognates, e.g. AFF ('off') and BAW ('ball').

Lexical variables
Our goal is to measure the rate at which people index their Scottishness (either consciously or subconsciously) through the use of distinctively Scottish words, and to find out whether this rate varies across different groups of users (Yes hashtag users vs. No hashtag users), or across different contexts (tweets which contain referendum-related hashtags vs. tweets that don't).
Were we to directly compare the frequencies of our Scottish terms across different sets of tweets, it would be difficult to untangle differences in the rate at which users are indexing the referents of those terms from differences in the rate at which they are indexing their Scottishness. For example, if people use the term MASEL ('myself') with a lower frequency in one context than in another, this could be because they are modulating their use of distinctively Scottish terms in response to the context, but it could also be because they are modulating the 6 While'bevy' is also used colloquially for 'beverage' in other parts of the UK, in Scotland it is more frequent and can additionally be used as a mass noun ("I had so much bevy I couldn't even carry it"), and as a verb ("I'd bevy with him every weekend"). rate at which they talk about themselves. To avoid this confound, we instead compare the conditional probabilities with which Scottish terms are used, given that their referents are being indexed at all.
We therefore consider only those Scottish terms for which we can identify semantically equivalent Standard English variants. We require that each variant of a given variable indexes the same set of senses and can occur in the same set of contexts, so for example we do not include YOUS as a variant of YOU, since while Scottish YI and Standard English YOU can index both the singular and plural second person pronouns, YOUS is only used for the plural. We also did not include variants of YES and NO since their use could be influenced by campaign slogans (e.g., the hashtags #VoteAye and #JustSayNaw). Our variables are listed in Table 3.

Study 1: Scotland-specific vocabulary
usage on either side of the debate Do tweeters who use Yes hashtags use Scottish variants at a higher rate than tweeters who use No hashtags, either when using these hashtags, or in general?

Method
We assign users in the IT dataset to two groups, Yes and No, based on the quantity nu,yes nu,yes+nu,no , where n u,yes is the number of tweets in which user u has used at least one of the Yes hashtags and none of the No hashtags in Table 1; and n u,no is the number of tweets in which u has used at least one No hashtag and none of the Yes hashtags. The Yes group consists of all users for whom this quantity is greater than or equal to 0.75, while the No group consists of all users for whom it is less than or equal to 0.25. Users for whom the value lies between 0.25 and 0.75 (as well as those for whom our dataset does not contain any tweets with Yes or No hashtags), are not assigned to either group. The Yes group  Table 2: Scotland-specific vocabulary. Standard English equivalents of many words are shown in Table 3. contains 4,513 users, while the No group contains 1,356 users, which is consistent with the general perception at the time that the Yes campaign was much more vocal than the No campaign. To test our hypothesis that the probability of choosing Scottish variants is, on average, greater for users in the Yes group than for users in the No group, we estimate the difference between the two groups in the average probability of choosing Scottish variants, and conduct a permutation test to approximate the distribution of this difference under the null hypothesis. We first test whether the Yes group are more likely than the No group to use Scottish variants in tweets which contain hashtags that indicate a stance on the referendum. Subsequently, we test whether the Yes group are more likely than the No group to use Scottish variants in general across all of their tweets.

Test statistic
Let U g be the set of all users in group g ∈ {yes, no} who have used at least one of the variables in Table 3. For a given user u ∈ U g , let V be the set of all variables that u has used in at least one tweet. We estimate the probability of user u choosing a Scottish variant of variable v ∈ V aŝ p u,v = nu,vscot nu,v , where n u,vscot is the token count of Scottish variants of v in user u's tweets, and n u,v is the token count of all variants of v in user u's tweets. Averaging across variables, we obtain p u = 1 V v∈Vp u,v . We then average across users to obtain the group mean,p g = 1 U u∈Ugp u . Our test statistic is the difference between the two group means, d =p yes −p no .

Permutation test
We randomly shuffle users between the two groups (maintaining each group's original number of users), and re-compute the value of d using these permuted groups. We repeat this procedure 100,000 times in order to approximate the distri-  Table 4: Number of users and tweets included per group in the two analyses in Study 1 bution of differences in group means that would be observable were the difference independent of the assignment of users to groups. The proportion of permuted differences which are greater than or equal to the observed difference between the original group means provides an approximate p-value.

Results
For a tweet to be included in the analysis, it must contain at least one of the variables in Table 3. Hence not all users contribute data to the test statistic, as some have not used any of the variables in their tweets. The number of tweets and users included in each analysis are shown in Table 4.
The results for the first analysis are shown in the left column of Table 5. The difference between the two groups in their average probability of choosing Scottish variants in tweets that contain polarised referendum hashtags is statistically significant (p < 0.002). Results for the second analysis are shown in the right column of Table 5. Once again, the difference between the two groups is statistically significant (p < 0.001).

Discussion
The results show that the Yes group do use Scottish variants at a significantly higher rate than the No group, both when using Yes or No hashtags, and in general. The stronger significance level for the 'All tweets' dataset is partly due to its larger size (see Table 4), which enables better estimates of the   Table 5: Results of the two analyses in Study 1 usage rates. While the rates are very low overall, the relative differences are large: the Yes group rate is more than three times the No group rate when we include only tweets with Yes or No hashtags, and approximately twice as big when we include all tweets. The higher rates in the 'All Tweets' dataset suggest that both groups of users chose Scottish variants less often when discussing the referendum than in their other tweets. However, the test we used does not provide a significance value for the difference in usage rates across the two datasets. To establish whether users do modulate their usage of Scottish variants when discussing the referendum, we will need a more careful paired design.
6 Study 2: Effects of topic and audience on Scotland-specific vocabulary usage Do tweeters choose Scottish variants at a different rate when using referendum-related hashtags than in their other tweets?

Method
We need a statistic that corrects for the fact that some variables might have higher rates of Scottish variants than others. For example if users tend to produce Scottish variants of variable v 1 at a higher rate than for v 2 , and use v 1 more in tweets that don't contain referendum-related hashtags, then it could appear that users are suppressing their Scottish usage in referendum-related tweets when in fact this is a lexical effect. Let U be the set of all users who have used at least one of the variables in Table 3 in both a tweet that contains a referendum-related hashtag (i.e. a tweet that belongs to the IT dataset, referred to hereafter as an Indyref tweet) and in a tweet that does not contain a referendum-related hashtag (referred to hereafter as a Control tweet). For a given user u ∈ U , let V be the set of all variables that u has used in at least one Indyref tweet, and in at least one Control tweet. Letp I,v for user u be the estimated probability that u chooses a Scottish variant of variable v ∈ V , conditioned on the fact that she is using variable v in an Indyref tweet. Analogously, letp C,v be the estimated probability that u chooses a Scottish variant of variable v, conditioned on the fact that she is using variable v in a Control tweet. The difference in user u's probability of choosing a Scottish variant of variable v in an Indyref tweet and in a Control tweet is then The null hypothesis is that on average, users are no more or less likely to choose Scottish variants in Indyref tweets than in Control tweets. Therefore, under the null hypothesis, the mean value of d u across all users,d u = 1 U u∈U d u , would be zero. We perform a one-sample t-test to determine whetherd u is significantly different than zero.
We use this method to conduct two separate analyses. In the first analysis, our pool of Control tweets is the set of all tweets from the original filtered dataset that do not contain any of the hashtags in Table 1. In the second analysis, we limit our pool of Control tweets to those which do not contain any of the hashtags from Table 1, but do contain at least one other hashtag. This second analysis is designed to test whether the recent finding that US Twitter users are less likely to use regionally-specific words in tweets which contain hashtags (Pavalanathan and Eisenstein, 2015) applies to Scottish users as well.

Results
The number of tweets and users that were included in each analysis are shown in Table 6.
Results for the first analysis are shown in the left column of Table 7. The difference is statistically significant (p < 0.01), indicating that on average, individuals are less likely to choose Scottish variants when using referendum-related hashtags than in their other tweets. Results for the second analysis are shown in the right column of Table 7. In this case, the difference is not statistically significant.

Discussion
In light of (a) the apparent relationship between national identity and constitutional preference, (b) the history of Scots as the prestige language of a previously-independent Scotland, supplanted by English in large part due to the birth of the United Kingdom, and (c) the results of Study 1, which indicate that pro-independence users choose Scottish variants at a significantly higher rate than anti-   independence users-it may at first appear surprising that people are less likely to choose Scottish variants in tweets containing referendum-related hashtags than in their other tweets. It is conceivable that Yes users increase their rate of Scottish variants in Indyref tweets whilst No users decrease it, such that their effects cancel out; but since Yes users are more prolific in the IT dataset, if anything we would expect this imbalance to make the effect even more positive. The fact that we see a significant negative effect in spite of the greater number of Yes tweets means we can be reasonably confident that even if Yes users aren't significantly reducing their usage of Scottish variants in Indyref tweets, they certainly aren't increasing it.
It is also worth noting that we did not exhaustively identify every hashtag that has been used in relation to the referendum, so inevitably there will be some tweets with referendum-related hashtags in the Control set (such as example tweet (3) in §2), and there may also be some non-referendum tweets in the Indyref set. However, if anything this would dilute any differences between the two lists, yet we still find an effect.
The fact that this effect does not reach significance when we remove Control tweets without hashtags suggests that the primary reason users are reducing their rate of Scottish variants in Indyref tweets is not because of the topic under discussion, but because the use of hashtags broadens the potential audience. This explanation accords with Pavalanathan and Eisenstein's (2015) finding that amongst Twitter users in the US, non-standard and regional variants are less likely to be used in tweets that target larger audiences. Of course, it is possible that topic has an effect as well, but the present study does not provide evidence for that conclusion.

Conclusion
We presented the first large-scale study of distinctively Scottish language use on social media, showing that this use includes a mixture of traditional Scots vocabulary, newer Scottish slang, and alternative spellings that reflect Scottish pronunciation. We also studied how users' language might reflect their political views and discourse. We showed that Yes users use Scottish variants at a higher rate than No users, whether discussing the independence referendum or not. But overall, users tend to decrease their use of Scottish variants when discussing the referendum. This result suggests that although Yes users generally express a stronger Scottish linguistic identity than No users, they are not choosing to express this identity strongly in political discourse aimed at a broad audience. Due to the very low rates of Scottish variants overall, our data set is too small to study differences between individual variables or even conclusively say whether there may be effects of both topic and audience size on the use of Scottish language. However, we hope to be able to answer these questions in future by collecting a more complete set of data for the particular users studied here.