Tkol, Httt, and r/radiohead: High Affinity Terms in Reddit Communities

Language is an important marker of a cultural group, large or small. One aspect of language variation between communities is the employment of highly specialized terms with unique significance to the group. We study these high affinity terms across a wide variety of communities by leveraging the rich diversity of Reddit.com. We provide a systematic exploration of high affinity terms, the often rapid semantic shifts they undergo, and their relationship to subreddit characteristics across 2600 diverse subreddits. Our results show that high affinity terms are effective signals of loyal communities, they undergo more semantic shift than low affinity terms, and that they are partial barrier to entry for new users. We conclude that Reddit is a robust and valuable data source for testing further theories about high affinity terms across communities.


Introduction
The evolution and semantic change of human language has been studied extensively, both in a historical context (Garg et al., 2017) and, increasingly, in the online context (Jaidka et al., 2018). However, few studies have explored the evolution of words across different online communities that allow a comparison between community characteristics and terms that have high affinity to a community.
The banning of r/CoonTown and r/fatpeoplehate in 2015, as analyzed by Saleem et al., provides good motivation for our work. r/CoonTown was a racist subreddit with a short life span of 8 months (November 2014 -June 2015) (Saleem et al., 2017).
During this time, as shown by Saleem (2017), these subreddits underwent rapid semantic development through which new words, such as "dindu", "tbi" and "nuffin" were not only created, but increasingly became more context-specific (accumulated in meaning). In r/fatpeoplehate existing words such as "moo", "xxl" and "whale" underwent localized semantic shift such that their meanings transformed to derogatory terms (Saleem et al., 2017).
These two cases demonstrate that not only are new words conceived within subreddits, existing words undergo localized transition. They also suggest that this phenomenon likely takes place in a short time period for high affinity words. In order to evaluate whether such trends are consistent across subreddits, we study semantic shift and the roles high affinity terms play in 2600 different subreddits between November 2014 to June 2015.
Our aim is to provide a characterization of high affinity terms by mapping their relationship to different types of online communities and the semantic shifts they undergo in comparison to generalized terms (low affinity terms). We leverage data curated from the multi-community social network Reddit and the types of subreddit characteristics we study are loyalty, dedication, number of users and number of comments. Our paper explores the following research questions: 1. Do certain community characteristics correlate with the presence of high affinity terms?
2. Do high affinity terms undergo greater semantic shift than low affinity terms?
3. Do high affinity terms and community characteristics function as a barriers to entry for new users to participate?
Some key findings include: 1. Loyalty is strongly correlated to the presence of high affinity terms in a community.
2. High affinity terms undergo greater semantic shift than generalized terms (low affinity terms) in a short interval of time.
3. High affinity terms, and dedication values of a subreddit strongly correlate to the number of new users that participate, indicating that the degree of high affinity terms establishes a lexical barrier to entry to a community.
2 Related Work and Concepts

Understanding Community Specific Terms
Before defining high affinity terms, we examine the traits observed in community specific terms from past literature. Studies have shown that words specific to a community have qualities of cultural carriers (Goddard, 2015). While culture is "something learned, transmitted, passed down from one generation to the next, through human actions," (Duranti et al., 1997) these transmissions through language affect a culture's system of "classifications, specialized lexicons, metaphors, and reference forms" in communities (Cuza, 2011). Pierre Bourdieu argues that language is not only grammar and systematic arrangemennt of words, but it is symbolic of cultural ideas for each community. To speak a certain language, is to view the world in a particular way. To Bourdieu, through language people are members of a community of unique ideas and practices (Bourdieu et al., 1991). As such, community specific terms are usually not easily translatable across different communities. For example, in Hungarian "life" is metaphorically described as "life is a war" and "life is a compromise", whereas in American English "life" is metaphorically represented as "life is a precious posesssion", or "life is a game" (KÃ, 2010). These definitions of similar entities vary due to different cultural outlooks in communities.
Besides words that are cultural carriers, slang is also a form of terminology specific to a community. While there is no standard operational definition of slang, many philosophical linguists define slang as terms that are vulgar (Green, 2016;Allens, 1993), encapsulate local cultural value and a type of insider speech that roots from subcultures (Partridge and Beale, 2002). Morphological properties of slang are defined as "extra-grammatical", and these morphological properties in slang are shown to be distinguishable from morphological properties of standard words in English (Mattiello, 2013). There has been an increase of slang in online spaces (Eble, 2012), with many terms falling under the extra-grammatical classifications of abbreviation ('DIY', 'hmu', 'lol'), blends ('brangelina', 'brunch'), and clippings ('doc (doctor)', 'fam (family)') (Mattiello, 2013;Kulkarni and Wang, 2018).
By extracting terms that have a high affinity to a community, we approximate words that are either cultural carriers or slang.

Measuring Affinity of Terms
Measurements for affinity of terms to a community have been explored in research, where the frequency of a word is compared to some background distribution to extract linguistic variations that are more likely in one setting (Monroe et al.; Zhang et al., 2018). Most helpful to our approach is a past study that computed a term's specificity sp c to a subreddit through the pointwise mutual information (PMI) of a word w in one community c relative to all other communities C in Reddit .
An issue with this metric is that terms with equal specificity can differ in their frequency. Specificity does not show which term is more dominant within a community by frequency, as show in Table 1. Due to this, we compute the affinity value of a term by measuring its locality and dominance to a community. Locality is the likelihood of a term belonging to some community, and dominance captures the presence of the term in the said community by its frequency.
We therefore calculate the locality l of a word w j in subreddit s i through the conditional probability of a word occurring in s i , relative to it occurring in all other subreddits S.
We then calculate dominance d in two steps. First we calculate an intermediate value r, which is the difference between the count of word w j in s i subtracted by the sum of all terms W in s i multiplied by constant , which in our work was sufficient as 0.0001. If the value of r is negative, we disregard it, as it is likely to be an infrequent word of little semantic significance, such as a typo. where Then, we calculate the dominance d as a negative hyperbolic function of each word's occurrence: Finally, we compute the affinity value of a word to a subreddit as a product of a word's dominance and locality: After extracting affinity values of each word relative to a subreddit, we partition the sets of words into high affinity terms and low affinity terms.
High Affinity Terms: For each subreddit, we extract 50 terms with the highest affinity values, and we categorize them as high affinity terms. The average of high affinity terms is denoted as high affinity average. Low Affinity Terms: For each subreddit, we extract 50 terms with the lowest affinity values, and we categorize them as low affinity terms. The average of low affinity terms is denoted as low affinity average.

How Semantic Shift Can Capture Cultural Shifts
As previously stated, high affinity terms are approximations for words that are either cultural carriers or slang.
Research has shown that shifts of local neighborhoods across embeddings are more effective in capturing cultural shifts than to calculate distances of a word across aligned embeddings, which is used to measure structural shifts (Hamilton et al., 2016;Eger and Mehler, 2016). Studies have represented k-nearest neighbors n of a word w through second-order vectors V |n| that are made of the cosine similarities between n and w, then calculate the difference between these second-order vectors to identify shifts (Hamilton et al., 2016;Eger and Mehler, 2016). Recent works have also modeled shifts in words through the change in common neighbors across different embeddings (Wendlandt et al., 2018;Eger and Mehler, 2016).

Measuring Semantic Change
Our measurement of semantic shift is based on the concepts of semantic narrowing of words, a process in which words become more specialized to a context, and semantic broadening of words, a process in which words become more generalized from a context (Bloomfield, 1933;Blank and Koch, 1999). We capture this contextual information by constructing 300 dimensional word embeddings (word2vec) for each subreddit using skip-gram with negative sampling algorithms, where a distributional model is trained on words predicting their context words (Mikolov et al., 2013). For each word, we measure narrowing as an increase in co-occurrence of a word's nearest neighbors, and broadening as a decrease in cooccurrence of a word's nearest neighbors (Crowley and Bowern, 2010).
To measure semantic shift, we extract common vocabulary V = (w 1 , ..., w m ) across all time intervals t ∈ T . Then, for some t and t + n, we take a word w j 's set of k nearest-neighbors (according to cosine similarity). These neighbor sets are denoted as A t k (w j ) and A t+n k (w j ). We then calculate the neighbours co-occurrence value CO as the Jaccard similarity of neighbours sets (Hamilton et al., 2016), in subreddit s i : | Then, we calculate the difference of CO across successive embeddings in T . We label, chronologically, the first time interval (t 1 ) as initial point and the last time interval (t p ) as terminal point, across which narrowing and broadening are measured. We used k = 10 for all computations.
Broadening Measurement: We measure broadening as the sum of the difference of CO s i between initial point embedding and all successive embeddings. This is defined as: By comparing an embedding to its future embeddings, we are able to see which contexts are lost as a word's meaning becomes more broad.
Narrowing Measurement: Similarly, we measure narrowing by calculating the sum of the difference of CO between terminal point embedding and all previous embeddings.
By comparing an embedding to its previous embeddings, we are able to see which contexts associated with a word have increased in specificity over time.
A visual representation of the metrics are provided in Figure 1.

Extracting Rate of Change of Frequency
Many past works have also modelled relationships between frequency and semantic shift (Lancia, 2007;Hilpert and Gries, 2016;Lijffijt et al., 2014).
One study shows that an increase in frequency of a term across decades results in a semantic broadening, while a decrease in frequency causes it to narrow (Feltgen et al., 2017). For example an increase in frequency of the word "dog" evolved its meaning from a breed to an entire species, and the decrease in frequency of "deer" localized its meaning from "animal" to a specific animal (undergoing narrowing) (Hilpert and Gries, 2016).
Very few studies have modelled narrowing and broadening of terms in the short term. As such, we are interested in the frequency patterns of terms that go through short-term cultural shifts. One study showed the effect of frequency on learning new words, and how it affects the use of new words in their correct context. They conducted their experiments in a physical capacity on children of five years old who were made familiar with new words (Abbot-Smith and Tomasello, 2010) in a short time period. Their results demonstrate that familiarizing children with new words allowed them to use the word in correct grammatical contexts, and greater frequency of exposure to new words resulted in more narrowed and correct use of the word to a context. This pattern of teaching is categorized as lexically-based learning.
Due to this, we assess whether in the short-term in subreddits, narrowing and broadening of terms correlates to the rate of change of frequencies.
We calculate rate of change of frequency across time periods T for a subreddit s i as such, where n is the size of T : A positive value shows an increasing rate of frequency, and a negative value shows a decreasing rate of frequency.

Characteristics of Subreddits
We introduce four quantifiers that describe subreddit networks based on existing typology. Using these quantitative chararacteristics we can evaluate and identify systemic patterns that exist between types of subreddits and high affinity terms. The four quantifiers are loyalty, dedication, number of comments, and number of users.
Loyalty: Previous work on subreddit characteristics has defined community loyalty as a percentage of users that demonstrate both preference and commitment, over other communities in multiple  time periods . Preference is demonstrated by more than half of a user's comments contributing to subreddit s i ∈ S, and commitment is measured by a user commenting in s i in multiple time periods t ∈ T . It has been shown that community wide loyalty impacts usage of linguistic features such as singular ("I") and plural ("We") pronoun . Communities with greater loyalty have a higher usage of plural pronouns than communities with low loyalty which have a heavier usage of singular pronouns. Following this finding, we investigate relationships between loyal communities and high affinity terms, to gauge whether loyal communities are also strongly correlated to use of other types of terms.
Dedication: Other studies have also shown that user retention correlates to increased use of subreddit specific terms (similar to high affinity terms) . We calculate user retention to measure a community characteristic similar to commitment as defined in a past study , by extracting users that comment in subreddit s i ∈ S a minimum of n number of times across all time periods t ∈ T and label this retention value as dedication. A key difference between dedication and loyalty is that a user does not have to contribute more than 50% of their comments to a particular subreddit to be dedicated, which is a requirement for loyalty. This means that a user can be dedicated to multiple subreddits, while a user is loyal to only one group at a particular time. The comparison between loyalty and dedication allows us to explore whether preference is a strong factor in the linguistic evolution of high affinity terms in online communities.

Number of Comments and Number of Users:
Lastly, we measure raw metadata of subreddits which are the number of comments made, and the number of users that participated in a subreddit.
Existing work has shown that areas with large populations experience a larger introduction of new words, whereas areas with small populations experience a greater rate of word loss (Bromham et al., 2015). Furthermore, words in larger populations are suspect to greater language evolution. While this is a correlation found in physical communities, we assess whether this remains consistent in online communities. As a proxy for population we consider both the number of users and the number of comments.

Description of Data
Our dataset consists of all subreddits between November 2014 to June 2015 with more than 10000 comments in that period. We performed our measures on the curated data in time intervals of 2 months. We manually removed communities that are mostly in non-ascii or run by bots. This resulted in a dataset of 2626 subreddits.

Qualitative Overview of High Affinity and Low Affinity Terms
We examine high and low affinity terms across subreddits. Our results, as shown in Table 2, demonstrate that high affinity terms have different characteristics across communities. Certain high affinity terms exist independent of online communications. For example in r/chess, the high affinity terms "bxc", "bxf", "nxe" are all numerical representations used to communicate game moves. Similarly in r/bravefrontier, Figure 2: This figure shows the relationship between community characteristics and high affinity averages. Each community characteristics is binned into intervals of 20% by percentile. Loyalty most strongly correlates with high affinity averages. the terms "zelnite" and "darvanshel" are game characters. However in r/fatpeoplehate, there are high affinity terms that originate online. Terms in r/fatpeoplehate demonstrate extra-grammatical qualities of slang, such as "fatkini", which is a blend of "fat" + "bikini", and "feedee", which is clipping of "feeder", signalling word development through online communication (Kulkarni and Wang, 2018).
Interestingly, across the topically different subreddits, abbreviations are common form of high affinity terms. For instance, "pgn" in r/chess stands for "portable game notation", "tkol" in r/radiohead stand for "The king of Limbs", "ftfy" in r/gif stands for "fixed that for you". The use of abbreviations illustrates the transformation of "gibberish" into collective meaning within a community. It is only with the context of domain and culture, that one can attribute meaning to these terms.
Named Entity Recognition (NER) of Top 100 and Bottom 100 subreddits by high affinity averages: We performed NER using bablefy on the names of the top 100 and bottom 100 subreddits by high affinity averages. Through this analysis, we observe that 82 of the top 100 are named entities, whereas only 18 of bottom 100 are named entities. Of the 82, 33 subreddits are videogames, 19 are regional subreddits and 11 are sports subreddits. This shows that communities with high affinity averages are likely to be strongly linked with a physical counterpart. Whereas the bottom 100 subreddits consisted of discussion and generalized subreddits such as r/TrueReddit, r/Showerthoughts, r/blackpeoplegifs whose creation and culture can directly be attributed to online communities rather than physical counterparts. This provides an explanation for subreddits with low high affinity averages having extremely generalized high affinity terms, such as "helmet" and "shoe" in r/Wellthatsucks.

Impact of Community Characteristics on Affinity of Terms
We conducted prediction tasks using community characteristics to demonstrate meaningful relationships between high affinity terms. We treated each of the community characteristics as features (log-transformed), and perform linear regressions, with five cross-validation, to predict the high affinity average (log-transformed) of a subreddit.

Prediction of High Affinity Terms from Community Characteristics
We find that loyalty of a subreddit is remarkably correlated to the high affinity average of subreddits. A linear model trained on loyalty to predict high affinity average of a subreddit achieves an R 2 of 0.364 (p-value < 0.001). Compared to this, a linear model trained on dedication results in an R 2 value of 0.274 (p-value < 0.001). This implies that preference is a strong factor in the likelihood of high affinity terms in communities.
In contrast, models trained on number of comments and number of users resulted in an R 2 of 0.071 and 0.004. Loyalty is therefore a much more effective measure than most standard community measures at least when measured on a linear scale. This finding supports existing work, which shows that distinctiveness of a community is strongly related to its rate of user retention .  Table 3: Coefficient of determination values for linear models trained on community characteristics that predict semantic narrowing (left) and semantic broadening (right) of high affinity terms.

Prediction of Low Affinity Terms from Community Characteristics
Although low affinity terms for almost all subreddits have values that are very close to 0, we find that raw subreddit meta data (log-transformed) is an effective predictor of low average affinity term value (log-transformed). A linear model trained on number of comments results in a R 2 of 0.456 (p-value < 0.001). This makes sense intuitively, because as the number of comments increases, low affinity terms have more likelihood of being generalized.
A similar model trained on number of users attains an R 2 of 0.180, with a model trained on loyalty performing the worst with an R 2 of 0.055.
Finally, as we might expect a multivariate regression model trained on both loyalty and number of comments performs the best out of all models, scoring an R 2 of 0.391 (p-value < 0.001) when predicting high affinity averages and scoring an R 2 of 0.470 (p-value < 0.001) when predicting low affinity averages, which are significant improvements.

Assessing Semantic Shift of High Affinity Terms
Calculating semantic shifts of high affinity terms enables us to test whether high affinity terms are subject to cultural shifts and whether linguistic developments in online spaces are consistent with trends in physical communities.

Evaluating Semantic Shift to Community Characteristics
We perform linear regression between community characteristics and semantic shifts to assess their relationships. Our results show that all community characteristics are weak predictors of semantic shifts. This is surprising as they are effective predictors of affinity values. Semantic Narrowing and Semantic Broadening: Table 3 shows that number of comments has the strongest correlation to semantic narrowing and semantic broadening of high affinity terms, achieving R 2 values of 0.037 and 0.019 (p-value < 0.001). In contrast, while loyalty and dedication have similarly high R 2 values when used for modeling semantic narrowing of high affinity terms as shown in Table 3, it is more weakly linked to the semantic broadening of high affinity terms.
Perhaps the most surprising finding is that number of users is a poor predictor of both semantic narrowing and semantic broadening (R 2 of 0.004 and 0.001) in online spaces. This is surprising because number of users and number of comments are highly correlated features (Pearson coefficient of 0.726, p-value < 0.001), but their performance in approximation of semantic shifts are broadly different.
These results provide insight into how the concept of "population" works in online spaces in contrast to physical communities. Previous works show a weak correlation between population of a geographic area and the occurrence of language evolution (Bromham et al., 2015;Greenhill et al., 2018). A limitation of these studies was their inability to account for language output that was not written (i.e, oral communications). This limitation is not present in online communities because all language output is recorded via online comments. As such, the number of comments having a stronger correlation to semantic shift than number of users, indicates that the amount of oral communication may have contributed to language evolution.

Comparing Semantic Shift in High Affinity and Low Affinity Terms
First we compute a metric that shows the overall semantic shift a subreddit has experienced. This is measured by computing the difference between semantic narrowing and semantic broadening, where a positive value indicates overall narrowing and a negative value indicates overall broadening. We label this result as net semantic shift. Then we compute net semantic shift for high affinity terms and low affinity terms for all subreddits. We find that out of 2626 subreddits, 1638 (62%) subreddits demonstrate a positive net semantic shift in high affinity terms, whereas, 1529 (58%) subreddits demonstrate a positive net semantic shift in low affinity terms.
In Figure 3, we show that across all subreddits, the sum of net semantic shift in high affinity terms is 20.462 (50.253-29.791), whereas the sum of net semantic shift in low affinity terms is 4.402 (17.878-13.476). This implies that high affinity terms in general are more likely to attain qualities that are defining of neologisms, and are more likely to be narrowed in communities across Reddit. This is explained by our results which show that the rate of decrease of semantic broadening is slower than the rate of increase of semantic narrowing (Pearson coefficient of -0.192, p-value < 0.001), as demonstrated by a regression coefficient of -0.148. This trend is consistent when modeling semantic narrowing and semantic broadening with other community characteristics.
Interestingly, in communities with very high affinity averages, we observe several cases where the semantic narrowing and semantic broadening are close to 0. Examples of such subreddits are r/kpop, r/chess, and r/Cricket. We notice that these groups contain terms that are essential and almost exclusive to the domain of that community. However, these terms do not undergo extraordinary cultural impetus that causes a shift in meaning. For instance in r/chess there is little motivation to use "bxe", "cdf" outside of the context of game moves.
Additionally, we observe highest semantic shifts in groups that are mostly video games, tv- shows and sports communities, with high affinity averages being less than 0.5 in most cases -the average high affinity score of top 100 semantic narrowing groups is 0.367. These lower scores that tend away from the possible extremes, show that niche terms that shift the most are also slightly distributed in few other communities, but clearly dominant in one. Terms that are likely to undergo high levels of semantic shift have potential of being cross-cultural and adopted by a different group of people. Study of external influence of high affinity terms in other communities is an area of future research, and may reveal factors that make some high affinity terms more likely to evolve in a short period of time.

Mapping of Frequency
Past studies show that in the long term words that narrow decrease in frequency (Feltgen et al., 2017;Hilpert and Gries, 2016). However, our results, as shown in Figure 4, indicate that in the short term net semantic shift is strongly correlated with increase in frequency.
By testing the relationship between ∆f s i and net semantic shift, we discovered a strong linear relationship (Pearson coefficient of 0.429, p-value < 0.001).
Language adoption studies have shown that increased familiarization with a word in the short term -measured through frequency -actually enables a person to use the word more accurately and precisely. This is achieved, in both adults and children through lexically-based learning (Abbot-Smith and Tomasello, 2010). Our results indicate that online communities also employ lexically-  based learning in the short term, and may factor into linguistic culture adoption and development. We derive this finding from the fact that increase in frequency is strongly correlated with semantic narrowing.

Barriers to Entry
In this section, we evaluate the impact high affinity values have on the rate of new users participating in each time period. We calculate the rate of new users δu(s i ) as: where U is the set of users in subreddit s i at time period t.
In Table 4 we present our results of regression and correlation testing. We find that dedication shows the strongest correlation to the rate of new users in a community. This insinuates that absolute preference is an unlikely indicator of δu(s i ).
Although weaker, high affinity terms also show a correlation to δu(s i ). However, as shown in Table 4, it is remarkable that dedication and high affinity averages outperform the combination of loyalty and dedication in predicting the value of δu(s i ). This is because loyalty has a stronger correlation with δu(s i ) than high affinity averages. Due to this, a model trained on loyalty and dedication should perform better. However not only does it not perform better than a model trained on dedication and high affinity averages, it performs worse than a model trained only on dedication. This suggests that loyalty likely captures barriers to entry similar to dedication but more poorly. It also suggests that high affinity terms and dedication capture different types of barriers to entry.
Furthermore, we observe that communities which show the least δu(s i ), are mostly topics that originate outside of Reddit, such as r/NASCAR (sports) and r/SburbRP (sexual roleplay).
These results indicate that there are linguistic and non-linguistic barriers that prevent peo-ple from engaging in certain online communities. While this may not be concerning for innocuous topics such as r/Chess, issues may arise for ideologically-themed subreddits. In the age of political polarization, hate groups and infamous echo chambers, further research could be conducted into barriers to entry and the role high affinity terms play.

Conclusion and Future Work
Through several analyses we have shown there to be a strong relationship between online community behaviour and several aspects of high affinity terms. We found correlations with subreddit characteristics related to collective user behaviour, especially loyalty. The high affinity terms underwent semantic shift at a high rate given our very condensed timescale. Finally, we showed a relationship between user retention and the presence of these terms.
All three conclusions, and the secondary analyses conducted alongside them, show that high affinity terms have strong potential for further elucidating online community behaviour, and likely are correlated with further characteristics more difficult to measure than subreddit loyalty such as community cohesion (the strength and salience of group identity (Rogers and Lea, 2004)) or behaviour leading to the formation of extremist hate groups. Finally, our results and further investigation can contribute to the literature surrounding the relationship between vocabulary and social mobility between groups.