Do Word Embeddings Capture Spelling Variation?

Analyses of word embeddings have primarily focused on semantic and syntactic properties. However, word embeddings have the potential to encode other properties as well. In this paper, we propose a new perspective on the analysis of word embeddings by focusing on spelling variation. In social media, spelling variation is abundant and often socially meaningful. Here, we analyze word embeddings trained on Twitter and Reddit data. We present three analyses using pairs of word forms covering seven types of spelling variation in English. Taken together, our results show that word embeddings encode spelling variation patterns of various types to some extent, even embeddings trained using the skipgram model which does not take spelling into account. Our results also suggest a link between the intentionality of the variation and the distance of the non-conventional spellings to their conventional spellings.


Introduction
Word embeddings play a key role in many NLP systems. Unfortunately, embeddings are opaque: It is difficult to interpret the individual vector dimensions and to understand which factors contribute to the relational similarity between words nearby in space. There has been an increasing interest in understanding what word embeddings are encoding, for example, to understand their impact on the performance of a wide array of NLP systems (Rogers et al., 2018). Moreover, this is especially important when embeddings are used as research objects themselves, e.g., to study biases in society over time (Garg et al., 2018). The body of work on analyzing embeddings has focused primarily on semantic and syntactic properties (Baroni et al., 2014;Gladkova et al., 2016;Levy and Goldberg, 2014;Mikolov et al., 2013b). Instead, we propose a new perspective on word embeddings by asking how spelling variation is encoded.
In social media, linguistic variation is abundant (Eisenstein, 2013;Tatman, 2015). For example, nonconventional spellings for nothing are nothin, nuthin, noooothing, nithing, and so on. Much work in NLP has focused on normalizing spelling variation (Han and Baldwin, 2011;. Similarly, recent work by Piktus et al. (2019) aims to push embeddings of misspelled words closer to the embeddings of their conventional forms. However, these approaches can remove valuable social (Eisenstein, 2013) and semantic (Grieve et al., 2017) signals. Crucially, many non-conventional spellings are not misspellings: by deliberately deviating from conventional spelling norms, writers create social meaning (Sebba, 2007). For example, a certain spelling may be used to evoke intimacy or to index a certain region. Capturing spelling variation patterns is therefore important for many applications, e.g., when analyzing social phenomena (Nguyen et al., 2016).
To illustrate, Figure 1 shows skipgram embeddings of nothing and spelling variants projected on a 2D plane using t-SNE (van der Maaten and Hinton, 2008). This figure clearly suggests structure in the embedding space as forms with different types of spelling variation are pulled apart. On the middle right side we find lengthened forms (e.g., noooooothing), which are often used for emphasis, and at the top we find forms that appear to be misspellings (e.g., nithing) and the conventional form (nothing). In the lower left corner, we find instances of g-dropping (e.g. nothin) and forms reflecting dialect pronunciation (e.g., nuffin, which presumably reflects a dialect pronunciation of the voiceless interdental fricative /θ/). Because spelling is not taken into account by skipgram, this suggests that these forms are used in different contexts. Indeed, research has associated a range of contextual factors with the use of specific types of spelling variations (Eisenstein, 2015). We therefore go beyond looking at whether non-conventional forms are close to their conventional form in the embedding space. Instead, we ask whether embeddings capture fine-grained patterns of spelling variation, such as information about the type of spelling variation (e.g., whether the final 'g' is dropped, or whether the vowels are omitted).
Why does spelling variation matter? If embeddings are supposed to represent language use, we should also expect them to represent meaning associated with the choice of a specific spelling. Nonconventional spellings often carry social meaning, e.g. reflecting and constructing social identities. But, crucially, this has been neglected in NLP at large and in representation learning more specifically, ignoring the fact that in language it doesn't only matter what is said, but also how it's said. If the relationship between embeddings of conventional and non-conventional forms is meaningful, then this opens up a range of opportunities of using embeddings for exploring questions in computational social science and sociolinguistics. Moreover, there are many downstream applications that would benefit (e.g., community detection, modeling conversation dynamics). And finally, NLP needs to move away from focusing almost exclusively on so-called "standard language", resulting in systems that do not work well for many social groups. Treating spelling variation as a meaningful phenomenon is one step in this direction.

Contributions
Our contributions are (i) a new perspective on the analysis of word embeddings by focusing on spelling variation, (ii) a new dataset with seven common types of spelling variation to analyze and evaluate word embeddings (Section 3), and (iii) an empirical investigation using three analyses, revealing that spelling variation is indeed encoded to some extent-even with the skipgram model, which does not take spelling into account-and that there are differences between the types of spelling variation (Section 4). The code is available on https://github.com/dongpng/coling2020.

Related Work
NLP research on spelling variation has mostly focused on text normalization (Han and Baldwin, 2011; and automatically extracting lexical variants (Gouws et al., 2011). Studies within computational sociolinguistics have analyzed the patterns and functions of spelling variation, e.g., in Twitter (Tatman, 2015), and research has suggested a deep connection between phonological and spelling variation (Eisenstein, 2015). Furthermore, Thurlow and Brown (2003) observed that spelling variation also personalizes and informalizes SMS messages. We look at spelling variation but focus on how spelling variation patterns are encoded in embeddings. There has been much interest in analyzing neural representations and more broadly neural networks (Belinkov and Glass, 2019). Besides word embeddings, representations for higher-level units such as sentences have also been analyzed (Conneau et al., 2018). Our study differs from work in this space not only because it is based on spelling but also because it is based on comparing forms with the same referential meaning. In contrast, many evaluation studies (e.g., Gladkova et al. (2016)) use pairs that are not semantically (cat:kitten) or grammatically (argue:argument) equivalent. Our study is more similar to Niu and Carpuat (2017), who used paraphrase pairs to investigate formality information in embeddings, and Shoemark et al. (2018), who focused on three language variety pairs. We are not aware of studies focusing on spelling.

Datasets
This section describes the datasets and the types of spelling variation we use in our analyses.

Data Collection
Spelling variation is especially common in social media. We therefore use data from two popular social media platforms. Our datasets complement each other in terms of their characteristics (Twitter versus Reddit, location constrained versus global English). We focus on posts in English.
Reddit Six months of Reddit comments (May-Oct 2018), posted in the top 200 subreddits based on comment counts (after manually excluding a few country specific subreddits to minimize non-English content). The dataset contains 269M posts and a vocabulary of 576k words (min. freq. count: 20).

Types of Spelling Variation
Our analyses are based on pairs of forms that generally have the same referential meaning but different spellings. In this paper, we refer to the form without the spelling variation as the conventional form (e.g., nice) and the form with the spelling variation as the non-conventional form (e.g., niiiiice). 1 We focus on seven common types of spelling variation (see also Table 1), based on observations in our data as well as based on types identified in related work: • Lengthening: The repetition of characters, e.g., thaaaaanks instead of thanks. Lengthening is often used for emphasis and it is a useful signal for sentiment detection (Brody and Diakopoulos, 2011). It has also found to be used more by younger Twitter users (Nguyen et al., 2013). We automatically searched for forms with their final character repeated (sequences of the same character of length ≥ 3) and for forms where the lengthening occurred by repeating an internal vowel.
• Swapped characters: Pairs for which the difference between the two forms is the swapping of two characters (Pennell and Liu, 2011). One of the forms is required to be included in an English wordlist from Aspell 2 (instances of metathesis 3 ).
• Common misspellings: We use a list of common misspellings. 4 • G-dropping (-ing vs. -in): G-dropping is a phonological alternation that is common in all forms of English. It has a long history of research in sociolinguistics, showing variation among many social factors, including social class and formality (Campbell-Kibler, 2006;Levon and Fox, 2014;Tamminga, 2017). We automatically searched for pairs of -ing and -in and manually selected the ones that were genuine cases of g-dropping, e.g., excluding instances like turin and turing.
• Nearby character substitution: Pairs for which one form is created by replacing a character by another character at an adjacent key (assuming a QWERTY keyboard layout).
• British vs. American spelling: Pairs based on heuristics from online sources. 5 We search for -our vs. -or, -yse vs. yze, doubling of l (-elled vs. eled, etc.), -ogue vs. og and -ence vs. ense. 6 We note that there are many other types of spelling variation not considered here, such as number replacement (e.g., 4ever vs. forever) and phonetic respellings (e.g., nite instead of night) (Shortis, 2016;Tagg, 2009). For each type, we selected pairs consisting of a non-conventional (e.g., backkkk) and a conventional form (back). We only included pairs with both forms occurring at least 20 times in each dataset, to have the same pairs for both datasets. Within each of our seven types of spelling variation, a conventional form occurs only once.
Pairs were identified automatically using heuristics, but the final selection was based on manual filtering. Two expert annotators annotated 50 pairs of each type, 7 resulting in substantial to near perfect agreement, see Table 1. For each pair, example posts were available to help interpretation. Disagreement occurred with cases where it was unclear whether both forms had the same referential meaning, for example with teeets/tweets (substitution for a nearby character) and plantin/planting (g-dropping, where one annotator was concerned that plantin was also a misspelling of plantain). Disagreement also occurred when the conventional form was unclear, e.g., for thiccc (lengthening). One annotator then went through all automatically identified pairs to select valid pairs. The final selection was then checked by a second annotator. We aimed for precision and therefore excluded ambiguous pairs.

Extra-Linguistic Variation
We analyze whether the types of spelling variation exhibit extra-linguistic variation in our data, by looking at estimated age distributions of Twitter users and subreddit distributions in Reddit.
Twitter For each form in our lists with spelling variations, we randomly sample five Twitter users. We estimate the demographics of these users using the M3 model (Wang et al., 2019), based on information such as the profile image and user name. The model notably does not make use of the language in tweets. Tweets with lengthened or g-dropped forms were more often written by younger Twitter users (Table 2). For example, 17% of the sampled users who used lengthened forms were estimated to be ≤ 18 years, as opposed to only 10% of the sampled users who used the corresponding conventional forms.
When we look at British versus American spellings, the trend is reversed. Our Twitter dataset was constrained to the London area. Of the users who used the British spelling, a larger proportion was estimated to be older (e.g. 56% of the users was estimated to be >= 40 years). On the other hand, of the users who used an American spelling, a larger proportion was estimated to be younger.  Reddit For each spelling variation type, we rank the subreddits based on their ratio of relative frequencies of conventional vs. non-conventional forms, see Table 3. Unsurprisingly, British forms occur most in UK-focused subreddits. For example, in the ukpolitics subreddit 94.88% of all occurrences of the words in the British/American spelling list are written with a British spelling. Overall, we also find that non-conventional forms occur with low frequency in TranscribersOfReddit, a subreddit for curating and supporting transcriptions of Reddit content, e.g., to assist users who rely on text to speech. Moreover, in this subreddit many posts were written by bots. In sum, we find that certain types of spelling variation indeed occur more frequently in certain contexts in our data. As a result, some of this patterning may also be captured in the embeddings.

Experiments
We now examine whether spelling variation is represented in a structured way in the embedding space using the selected pairs (Section 3.2). No intrinsic evaluation method is perfect-we therefore present three different types of analysis that complement each other: analogy tests (Section 4.2), a similarity analysis (Section 4.3), and diagnostic classifier tests (Section 4.4). Our goal is not to achieve the highest performance. Instead, by analyzing the performance-and how this differs between spelling variation types-we obtain new insights on what embeddings capture about spelling variation.

Word Embeddings
We experiment with two word embedding models, which map words to a d-dimensional space. The first is the continuous skipgram model with negative sampling (Mikolov et al., 2013a), using the gensim implementation (Řehůřek and Sojka, 2010). The skipgram model does not take the spelling of a word into account. As a natural comparison, we therefore also experiment with fastText (Bojanowski et al., 2017), 8 an extension of the continuous skipgram model, which does take spelling into account: each n-gram is represented by a vector representation and a word is represented by the sum of the vector representations of its n-grams. Both skipgram and fastText use the contexts of words to learn the embeddings. More specifically, for a given word w i the surrounding words {w i−k , . . . , w i−1 , w i+1 , . . . , w i+k } are used as context. Although contextual factors associated with spelling variation are not taken into account explicitly, using the surrounding words as context could already implicitly capture such factors to some extent, causing spelling variation patterns to be encoded in the embeddings. The embeddings were trained with a min. freq. count of 20, a window size of 5, and dimensions from 50-300, with a step size of 50. We used the default settings for the remaining parameters.

Experimental Setup
Analogies are frequently used to analyze whether word embeddings encode certain relations, by testing whether the right answers can be recovered using vector operations (Mikolov et al., 2013b;Levy and Goldberg, 2014). Here, the analogies are based on pairs of spelling variants. Take two pairs, e.g., (a: cookin, a * : cooking) and (b: movin, b * : moving). Then, the assumption is that the following holds: cookin − cooking ≈ movin − moving, or a − a * ≈ b − b * . If b * is unknown, it can be found with: The inputs a, b and a * are excluded as answers. This method is referred to by Levy and Goldberg (2014) as 3COSADD. For example, given cookingcookin + movin, words are ranked according to their cosine similarity with the resulting vector. The goal is to rank the correct answer (moving) at the top. However, Linzen (2016) notes that results obtained this way can be misleading: the offsets are often very small, so that in practice often just the nearest neighbor to b is returned. We thus follow Linzen (2016) by reporting a control setting that just returns the nearest neighbor of b (ONLY-B), e.g., in our example returning the nearest neighbor of movin. A higher performance using the analogy setting (3COSADD) over the control setting (ONLY-B) would indicate that some relevant information is encoded in the embeddings.
We generate analogy instances by combining each pair (consisting of a conventional and nonconventional form) with a random other pair with the same spelling variation type. We do this ten times per pair, e.g., resulting in 2,580 analogy instances for lengthening. We generate analogies that aim to recover the conventional form. Pairs with a lengthened form are matched to pairs with the same amount of lengthening. We report accuracy for the best match and the mean reciprocal rank.

Results
The results are presented in Figure 2 (Mean Reciprocal Rank (MRR)) and Table 4 (accuracy).
3COSADD versus the control setting (ONLY-B) In most settings, the analogy setting (3COSADD) attains a higher performance than the control setting (ONLY-B). Spelling variation types The performance on g-dropping pairs is generally high and a strong improvement over the control setting (ONLY-B) is observed when using the analogy setup (3COSADD). Here, we also see a clear improvement in performance as the number of dimensions increases. FastText does particularly well, probably because the conventional forms can be recovered using fixed operations (e.g., adding a 'g').
In contrast, performance on the lengthening pairs is very low, with near zero accuracies. The control setting (ONLY-B) performs badly, and through manual inspection we found that both embedding models return many lengthened forms. This might be exacerbated with fastText, which seems to be sensitive to the n-grams in lengthened forms. However, even for lengthening we do observe a slight improvement from 3COSADD over the control setting (ONLY-B), for both fastText and skipgram.
For the British/American spelling pairs we observe a clear performance difference between the two datasets. The performance on Twitter is much lower, but this is unsurprising as the Twitter dataset is constrained to the London area. The performance for the control setting (ONLY-B) on the swapped characters, common misspellings and British/American spelling pairs is relatively high. Our second analysis (Section 4.3) helps us to understand this trend.  FastText vs. skipgram The skipgram model does not take a word's spelling into account, while the fastText model does. FastText does particularly well on g-dropping ( Figure 2a). This is expected, as the right answer for these cases can be recovered exactly using orthographic operations. However, the skipgram model's performance is notable as the analogy setting (3COSADD) improves over the control setting (ONLY-B) for most cases. Our results therefore indicate that skipgram, a spelling-agnostic embedding method, captures contextual factors that correlate with the use of different types of spelling variation. Moreover, in many settings it attains a higher performance than fastText. We note that we have not tuned the models' hyperparameters, and more in-depth comparisons are thus left for future work.

Word Similarity
To shed more light on the analogy results, we also compute the average cosine similarity between the embeddings of the non-conventional and conventional forms of each pair (Table 5).
With skipgram, we can broadly distinguish between two cases. We find that the types with the lowest average cosine similarities are the ones where the spelling variation is very likely intentional: g-dropping, lengthening and the omission of vowels. In contrast, the cosine similarity is higher for the types where the variation is probably often unintentional: swapping of characters and common misspellings. When spelling variation is intentional, we expect their usage to be highly context dependent (e.g., the same author might or might not use lengthening depending on the situation).  Table 5: Mean cosine similarities (300 dim.), averaged over both Twitter and Reddit embeddings Different from these types of spelling variation are the British versus American spellings. Both forms are highly conventional and exhibit strong regional variation. Generally, we would expect less variation across social contexts and within individuals. For these pairs we also find high cosine similarities between the British and American spellings, for both fastText and skipgram.
Overall, fastText and skipgram show similar trends, except for g-dropping. These pairs tend to have a high cosine similarity with fastText embeddings. This is not surprising: the forms only differ in one character and fastText takes the n-grams into account. However, g-dropping tends to be used in specific social contexts, and fastText embeddings might represent the conventional and non-conventional forms artificially close to each other. Generally, the spelling variation types with high cosine similarity between the pairs are also the ones where just returning the nearest neighbor (ONLY-B) performs relatively well (Figure 2), for example, for common misspellings.
To test whether the type of spelling variation indeed has a significant effect on the cosine similarity between the conventional and non-conventional forms, we fit a logistic regression model with the cosine similarity as the dependent variable. As the independent variables we include the spelling variation type, dataset, embedding model, and corpus frequencies (for both the conventional and non-conventional form). The spelling variation types remain highly significant. The model achieves a fit of R 2 =0.37.
As reference points (Table 5), we report cosine similarities for random pairs and pairs from two linguistic relations from BATS (Gladkova et al., 2016), verb infinitive:participle (ask:asking) and noun:plural (car:cars). The cosine similarities are low for the random pairs. For the BATS pairs, the similarities are higher, but still lower or comparable to common misspellings and British/American spelling pairs.

Diagnostic Classifiers
As highlighted by Linzen (2016) and Rogers et al. (2017), analogy experiments are limited as linguistic properties may be encoded in embeddings in ways not visible through a linear offset. Following Hupkes et al. (2018), Adi et al. (2017), Zhu et al. (2018), and others, we perform classification experiments using so-called diagnostic or probing classifiers. We build classifiers to predict the type of spelling variation based on the word embeddings alone, building on the core assumption that their performance reflects the extent to which information about spelling variation is encoded in the embeddings.

Experimental Setup
We frame the task as a multi-class classification problem: for a given pair (consisting of a nonconventional and conventional form), predict the correct type of spelling variation of the nonconventional form. For example, for the pair thaaaaanks and thanks, the correct output is the lengthening class. In total, we have 794 pairs and seven classes. We use logistic regression with L2 regularization implemented using scikit-learn (Pedregosa et al., 2011), opting deliberately for a linear classifier. No parameter tuning was performed and results are reported using ten-fold cross validation. With the analogy experiments, a higher number of dimensions generally led to better performance. We therefore only experiment with 300 dimensions. We experiment with both normalized and unnormalized embeddings.
Control settings A challenge is that certain types of spelling variation are associated with certain types of words (similar to problems highlighted by Levy et al. (2015)). For example, in our data g-dropping occurs mostly with verbs. Thus, a classifier might just learn to recognize, say, verbs to identify g-dropping  instead of the real phenomenon. We therefore include a control setting (CONTROL1), in which a classifier predicts the spelling variation type based on the embeddings of the conventional form alone. Moreover, word embeddings encode information about word frequency (Schnabel et al., 2015), which can correlate with the type of spelling variation. For example, in our data, the g-dropping forms have higher corpus frequencies than the character substitution forms. We therefore have another control setting that extends the previous one with a feature indicating the frequency of the non-conventional form (on a log scale), CONTROL2. A classifier being able to attain better performance than these control settings is evidence that embeddings encode information associated with the type of spelling variation.
Main setting Our main classifier is DIFF-EMB, which predicts the type of spelling variation based on the conventional form's embedding subtracted from the non-conventional form's embedding. Table 6 shows both macro and micro F1 scores. We see a clear performance improvement of DIFF-EMB over the control settings.

Results
Control settings CONTROL1 reaches relatively high F1 scores, even though its predictions are not based on the embeddings of the non-conventional forms, because many types of spelling variation tend to occur with certain word categories. For example, the classifier using normalized fastText embeddings trained with Twitter data (setting 5) obtains a high F-score (0.91) for g-dropping. It also reaches good performance for British/American spelling pairs (F1 score: 0.72) and for lengthening (F1 score: 0.75). In our data, for example, lengthening occurs mostly on interjections and on highly conversational word forms, as well as words associated with expressing information about quality and quantity. Low performance was obtained for the vowel omission (F1 score: 0.04) and keyboard substitution (F1 score: 0.00) pairs. These two types also have fewer instances in our data. We also see a consistent but small improvement with CONTROL2, which adds frequency information. For both CONTROL1 and CONTROL2, unnormalized embeddings perform slightly better than normalized embeddings on macro F1 scores. Finally, we note that the results are considerably higher than what a majority-class classifier, which would always predict lengthening, achieves (F1 0.32 micro, 0.07 macro).
Main setting DIFF-EMB achieves a clear performance improvement over the control settings, which is evidence that the embeddings capture useful information that go beyond information associated with the conventional form. Performance using unnormalized embeddings is consistently better, one possible reason is that the norm of an embedding is related to the corpus frequency of the form. Better performance is also achieved with Reddit embeddings. Furthermore, fastText embeddings perform slightly better than skipgram embeddings, which is expected given that fastText takes the spelling of words into account.
On the Reddit data, both (normalized) fastText and skipgram embeddings (settings 1 and 3) lead to high performance (F1-score≥0.90) for British/American pairs, g-dropping, lengthening, and common misspellings. Performance was low for keyboard substitution pairs, which were often misclassified as having swapped characters.

Conclusion
Deviations from conventional spelling norms often carry social meaning (Sebba, 2007). Depending on the application, it may thus not be desirable to map all spelling variants to the same point in the embedding space, especially in the areas of computational social science (Lazer et al., 2009) and computational sociolinguistics (Nguyen et al., 2016). This paper is a first step towards analyzing how spelling variation is captured in word embeddings. We presented three analyses in which we looked at seven types of spelling variation. Taken together, we found that even embeddings trained using the skipgram modelwhich does not take spelling into account-encode spelling variation patterns to some extent. Moreover, our analysis of the cosine similarities between non-conventional and conventional forms for different types of spelling variation suggest a link between the intentionality of the variation and the distance to the conventional forms. A possible explanation is that intentional spelling variation might occur in more specific contexts; future work should investigate this more deeply. We also note that although nonconventional spellings generally have the same referential meaning as their conventional form, they can sometimes also mark specific new meanings of words (Grieve et al., 2017). As future work, contextual embeddings (Devlin et al., 2019;Peters et al., 2018) and other spelling variation types can be explored. We also plan to investigate other classifiers and feature representations to capture the interaction between two embeddings for our diagnostic classifiers (Levy et al., 2015).