Investigating African-American Vernacular English in Transformer-Based Text Generation

The growth of social media has encouraged the written use of African American Vernacular English (AAVE), which has traditionally been used only in oral contexts. However, NLP models have historically been developed using dominant English varieties, such as Standard American English (SAE), due to text corpora availability. We investigate the performance of GPT-2 on AAVE text by creating a dataset of intent-equivalent parallel AAVE/SAE tweet pairs, thereby isolating syntactic structure and AAVE- or SAE-specific language for each pair. We evaluate each sample and its GPT-2 generated text with pretrained sentiment classifiers and find that while AAVE text results in more classifications of negative sentiment than SAE, the use of GPT-2 generally increases occurrences of positive sentiment for both. Additionally, we conduct human evaluation of AAVE and SAE text generated with GPT-2 to compare contextual rigor and overall quality.


Introduction
African American Vernacular English (AAVE) is a sociolinguistic variety of American English distinct from Standard American English (SAE) with unique syntactic, semantic, and lexical patterns (Green, 2002;Jones, 2015). Millions of people from predominately Black communities in the United States and Canada use variants of AAVE on a daily basis. Although AAVE has historically been used in spoken contexts, the growing use of social media has encouraged AAVE in written media for which NLP models are increasingly being used.
Recent work in Natural Language Generation (NLG) has introduced GPT-2, a Transformer-based language model that generates high-quality, coherent text when prompted by arbitrary input . However, GPT-2 displays bias * Equal contribution. towards particular social groups (Solaiman et al., 2019). Sheng et al. (2019) shows that NLG tools are biased with regard to the subject of a sentence when that subject belongs to an underprivileged group, and Shen et al. (2018) tests sentiment analysis tools with intent-controlled pairs with varying stylistic inclinations. Studies regarding AAVE have analyzed tasks such as POS tagging (Jørgensen et al., 2016), detecting AAVE syntax (Stewart, 2014), voice recognition and transcription (Dorn, 2019), dependency parsing (Blodgett et al., 2016), and hate speech detection (Sap et al., 2019), but not language generation. Coupled with concerns that NLG tools can be used for generating fake news (Gehrmann et al., 2019) or impersonating internet users (Zellers et al., 2019), it is important that current work investigates the contexts in which NLG models display bias against certain demographics.
In this paper, we examine the bias of GPT-2 text generation against AAVE features. We create a new dataset of AAVE/SAE content-controlled pairs by retrieving AAVE tweets and employing human translators to obtain their SAE counterparts. By doing so, we isolate AAVE syntactic structures and lexical items. We then prompt GPT-2 with the first segments of each AAVE/SAE pair. The generated text is compared to its corresponding original second segment by BLEU, ROUGE, and sentiment scores. Additionally, we provide human evaluation for the generated text based on context and quality.
Thus, our contributions include: • An intent-equivalent dataset of AAVE/SAE pairs with differences only in syntactic structure and dialect-specific vocabulary.
• New evaluation of GPT-2 using sentiment analysis, BLEU, and ROUGE scores of its generated text and the original SAE and AAVE segments. Figure 1: Terms used to refer to segments of each AAVE/SAE pairwise sample. Each first segment is used to prompt its respective generated segment and sentiments are taken of the second and generated segments.
• Human evaluation of GPT-2 generated text for each AAVE/SAE pair, where evaluation is conducted to identify contextual accuracy, quality, and likelihood of being categorized as machine-generated.

Dataset
Our dataset consists of tweets identified as having at least 99.9% confidence of using AAVE lexical items by the TwitterAAE dataset (Blodgett et al., 2016). We then obtain the SAE equivalent of each of these tweets by employing Amazon Mechanical Turk (AMT) annotators for a total of n = 2019 AAVE/SAE pairs. The average length of the original AAVE tweets is about 21 words, and the average length of the SAE counterparts is about 22 words. These samples are intended to be used as a test set for probing neural language model-based text generation. We use the terms "first segment," "second segment," and "generated segment" to refer to the different sections of each AAVE/SAE sample throughout this paper. A visualization of these partitions can be seen in Figure 1. TwitterAAE (Blodgett et al., 2016) collects AAVE tweets by using a distantly supervised mixed-membership model on samples that are geolocated to African-American blockgroups, as defined by the U.S. Census data. The tweets have been filtered to ensure conversational language and verified as AAVE on the basis of AAVE-specific lexical item inclusion, phonological phenomena in orthographic variation, and syntactic construction. From TwitterAAE, we randomly sample tweets that contain at least 15 words and have a posterior probability of being demographically-aligned to AAVE of at least 99.9%. We remove hashtags as they are social media-specific occurrences and emoji since we expect them to have disproportionate influence on sentiment scores.

Pairwise Sample Collection
To investigate GPT-2 generated text on AAVE versus SAE, we use (small) GPT-2 (Radford et al., 2019) from Open-AI for text generation, which is pretrained on outbound sources from Reddit comments with at least three karma.
Although prior work exists in using unsupervised word embeddings to create vector spacealigned demographic translations (Shen et al., 2018;Lample et al., 2018), we instead use human translation for accuracy purposes. We therefore employed AMT annotators to obtain the SAE equivalents of our AAVE samples.
Each AMT worker was given an AAVE tweet sample, first as a whole for context and then split into a first segment and a second segment. The latter consisted of the last five words of the sample, so as to take approximately a third of the full sample (see Figure 1). We asked annotators to translate the first and second segments individually into SAE; this partition was necessary for use with GPT-2, BLEU, and ROUGE. We provided example translations, and the full instructions can be seen in the AAVE to SAE protocol annotation guidelines. Annotators were filtered by HIT approval rate (higher than 97%) and location (within the United States). Additional instructions included either expanding or providing a contextual equivalent for acronyms, insertion of SAE-appropriate grammar, and preservation of overall structure and intent of the AAVE sample. Annotators were also told to translate the n-word, but to retain non-AAVE-specific explicit language. Dataset Viability We test the variability of our dataset's results by taking 1000 random partitions of size 1500 and use DistilBERT (Sanh et al., 2019) to find the average sentiment score. For each partition of our data (both SAE and AAVE with and without generation by GPT-2), the sample variance is under 0.02%.

Semantic Evaluation
Previous work has shown that non-AAVE speakers often fail to demonstrate comprehension of AAVE speech, and we acknowledge that such misunderstandings may influence the intent-equivalence of our dataset (Jones et al., 2019). Thus, to determine the semantic validity of the translations, we asked annotators who selfidentified as native AAVE speakers and/or codeswitchers to verify whether translated SAE phrases preserved the meaning of original AAVE phrases.
Of 156 randomly sampled AAVE/SAE pairs, 90% are intent-equivalent according to native AAVE speakers, and 95% according to code-switchers. This confirms that the majority of our pairs have semantic equivalence. We have included the instructions for this validity check in the Semantic equivalence protocol.

Sentiment Analysis
We use a sentiment analysis pipeline from Huggingface 1 , to evaluate the sentiment of our samples. The pipeline uses distilbert-base-uncased-finetuned-sst-2-english 2 , which is pretrained on movie reviews from the Stanford Sentiment Treebank (Socher et al., 2013). In addition to the DistilBERT sentiment classifier, we use VADER, which is a lexicon and rule-based sentiment analysis tool that is attuned to social-media specific sentiment intensity (Hutto and Gilbert, 2015), and TextBlob 3 , which does not have documentation on its implementation. However, we justify our use of the latter through its widespread use as an off-the-shelf sentiment classifier, such as in Sheng et al. (2019). The DistilBERT sentiment classifier restricts classifications to either positive or negative, with degrees of confidence ranging from 0 to 1; we translate this to a -1 to 1 negative-to-positive scale. From VADER we use the compound score, and from TextBlob the polarity; both metrics are normalized and weighted and thus also range from -1 to 1. VADER and TextBlob scores include 0.0, or neutral, while the DistilBERT sentiment classifier does not. We average the latter two in Table 1 to account for model variability in the sentiment classifiers, but keep the DistilBERT scores separate because it does not include neutral classifications.
Baseline As a baseline, we compare the sentiment of each AAVE original second segment to its respective SAE original second segment. We observe that the pretrained sentiment analysis models categorize AAVE as more negative than SAE, despite having the same intent. AAVE has 157 (7.7 % percent) more negative instances than it does positive when using DistilBERT and 37 (1.8 % percent) more negative and neutral instances when using the VADER-TextBlob average. The VADER-TextBlob averages appear to be less biased against AAVE than DistilBERT.

Sentiment Comparison of Generated Text
To determine whether GPT-2 generates more negative phrases when provided AAVE text, we compare the sentiment of the generated segment for AAVE to its corresponding generated segment for SAE. For DistilBERT we see that the average for AAVE generated segments is -0.0769, while its SAE counterpart is -0.0399 (see Table 1). This indicates that the AAVE GPT-2 generated segments are more negative than their corresponding SAE segments. We see the same trend for the VADER and TextBlob averages, where the AAVE generated segment has a more negative sentiment score than its corresponding SAE segment. Additionally, in the case of the VADER-TextBlob average, the negative senti-ments of the original second segments for SAE and AAVE differ by a margin of 0.57%, whereas the difference between the generated negative sentiments is 6.93%, with AAVE being more negative. This shows that even though AAVE has more positive instances than SAE for its original second segment, the use of GPT-2 increases negative sentiment more for AAVE than for SAE.
We also perform a McNemar-Bowker significance test on the results from Table 1 and find a significant difference between the original and generated sentiments for DistilBERT AAVE, VADER AAVE and SAE, and TextBlob AAVE and SAE with α = 0.05. VADER and Textblob for both AAVE and SAE had p < 0.01. DistilBERT for AAVE had p = 0.012 and DistilBERT for SAE had p = 0.11.

Flipped Sentiment
We compare the sentiment of the second segment of each AAVE phrase to the sentiment of its generated segment and do the same for each corresponding SAE sample. This allows us to observe the extent to which GPT-2 flips the sentiment from positive to negative and vice versa, and whether flipping from positive to negative sentiment is more prevalent in AAVE.
We find that AAVE samples have lower sentiment scores than their SAE equivalents with the classifiers we utilized. However, the AAVE generated segments increase in DistilBERT sentiment score going from -0.1436 to -0.0769 on the -1 to 1 scale, while SAE generated segments decrease from 0.0066 to -0.0399 (see Table 1). However, this is not the case with the VADER-TextBlob average, as the sentiment scores increase for both AAVE and SAE generated segments when compared to their respective second segments.
For the VADER-Textblob average in Table 1, AAVE generated segments are 50.38% less neutral than their original second segments, and SAE generated segments are 46.8% less neutral. While the majority of the original second segments are classified as neutral, the majority of the generated segments are instead classified as positive. However, SAE has a larger increase in positive sentiment scores than AAVE, even though its original positive sentiment was lower than AAVE's corresponding original sentiment.

Quality of Generated Text
We use BLEU, ROUGE, and human evaluation scores to determine the difference in the quality of GPT-2 generated text for SAE and AAVE samples.
BLEU and ROUGE For all SAE and AAVE samples, we isolate the second segment of the original sample, for which we take the last five words, and the first five words generated by GPT-2. We then compare the generated segment to the original second segment by calculating their BLEU and ROUGE scores. Specifically, ROUGE-1 and ROUGE-2 measure the overlap of unigrams and bigrams respectively, and ROUGE-L identifies the longest co-occurring sequence between a generated phrase and a reference phrase. BLEU-1, 2, and 3 are the cumulative 1-gram, 2-gram, and 3-gram scores for these pairs of phrases.
Both BLEU and ROUGE results indicate that GPT-2 typically generates more accurate sentences for SAE than for AAVE (see Figures 2 and 3). We note that the BLEU and ROUGE scores are relatively low since the comparison is between incomplete sentences of only five words.
We use a Wilcoxon rank-sum test to determine the significance of our BLEU and ROUGE results. With α = 0.05, ROUGE-1 and ROUGE-L are significant. Additional p-values can be found in Table 2.  Human Evaluation We also conduct human evaluation using AMT to assess the quality of the text generated by GPT-2. Annotators were filtered by HIT approval rate (higher than 95%) and location (within the United States). They were given the first segment of an SAE phrase for context, followed by its corresponding GPT-2 generated segment. We did the same with each corresponding AAVE phrase. Annotators were asked to choose which one of the two generated phrases better fits the context of the respective first segment, which one has better quality, and which one is most likely machine-generated. Ties were allowed for this task. The annotator instructions for this task can be found in the Human evaluation protocol.
Results show that 21.7% more annotators indicate that SAE generated segments have better quality than their corresponding AAVE generated segments, and 12% more annotators indicate that SAE generated segments fit the context better than their AAVE generated segment counterparts (see Table 3). To determine existing bias in human evaluation, we perform the same evaluation on the original second segments of AAVE/SAE pairs and find that 48% choose the SAE original second segments as likely machine-generated, while 31% choose the AAVE original second segments. Looking at 3, the proportion of annotators who select SAE as machine generated decreases to 37.3%, whereas the proportion for AAVE increases to 42.1%. This indicates that GPT-2 worsens the quality of AAVE segments while improving the quality of SAE segments. These findings support our results from BLEU and ROUGE in demonstrating the unequal quality of GPT-2's text generation for SAE and AAVE, thus signifying a bias against AAVE.

Conclusion
Through this work, we highlight the need for AAVE-inclusivity in NLG models, especially those perceived as state-of-the-art. To this end, we provide a new evaluation of NLG models by comparing GPT-2's behavior on SAE and AAVE. In addition, we present a new dataset consisting of intent-parallel AAVE/SAE tweet pairs, which can be used in future works studying SAE and AAVE  Table 3: Human evaluation results, where "MG" refers to "Machine Generated." Tests are conducted pairwise between generated SAE and AAVE phrases.
in NLP models. Our sentiment analysis experiments indicate that GPT-2 produces more negative instances when prompted with AAVE text. Moreover, our BLEU, ROUGE, and human evaluation results reveal a disparity in the quality of GPT-2's text generation between AAVE and SAE. We hope our findings can pave the way for further inclusion of diverse language in future NLG models.