Learning to Recognize Dialect Features

Building NLP systems that serve everyone requires accounting for dialect differences. But dialects are not monolithic entities: rather, distinctions between and within dialects are captured by the presence, absence, and frequency of dozens of dialect features in speech and text, such as the deletion of the copula in “He ∅ running”. In this paper, we introduce the task of dialect feature detection, and present two multitask learning approaches, both based on pretrained transformers. For most dialects, large-scale annotated corpora for these features are unavailable, making it difficult to train recognizers. We train our models on a small number of minimal pairs, building on how linguists typically define dialect features. Evaluation on a test set of 22 dialect features of Indian English demonstrates that these models learn to recognize many features with high accuracy, and that a few minimal pairs can be as effective for training as thousands of labeled examples. We also demonstrate the downstream applicability of dialect feature detection both as a measure of dialect density and as a dialect classifier.


Introduction
Dialect variation is a pervasive property of language, which must be accounted for if we are to build robust natural language processing (NLP) systems that serve everyone. Linguists do not characterize dialects as simple categories, but rather as collections of correlated features (Nerbonne, 2009), such as the one shown in Figure 1; speakers of any given dialect vary regarding which features they employ, how frequently, and in which contexts. In comparison to approaches that classify speakers or documents across dialects (typically using metadata such as geolocation), the feature-based perspective has several advantages: (1) allowing for fine-grained comparisons of speakers or documents * Work done while at Google Research. within dialects, without training on personal metadata; (2) disentangling grammatical constructions that make up the dialect from the content that may be frequently discussed in the dialect; (3) enabling robustness testing of NLP systems across dialect features, helping to ensure adequate performance even on cases other than "high-resource" varieties such as mainstream U.S. English (Blodgett et al., 2016); (4) helping to develop more precise characterizations of dialects, enabling more accurate predictions of variable language use and better interpretations of its social implications (e.g., Craig and Washington, 2002;Van Hofwegen and Wolfram, 2010). The main challenge for recognizing dialect features computationally is the lack of labeled data. Annotating dialect features requires linguistic expertise and is prohibitively time-consuming given the large number of features and their sparsity. In dialectology, large-scale studies of text are limited to features that can be detected using regular expressions of surface forms and parts-of-speech, e.g., PRP DT for the copula deletion feature in Figure 1; many features cannot be detected with such patterns (e.g. OBJECT FRONTING, EXTRANEOUS ARTICLE). Furthermore, part-of-speech tagging is unreliable in many language varieties, such as re-gional and minority dialects (Jørgensen et al., 2015;Blodgett et al., 2016). As dialect density correlates with social class and economic status (Sahgal and Agnihotri, 1988;Rickford et al., 2015;Grogger et al., 2020), the failure of language technology to cope with dialect differences may create allocational harms that reinforce social hierarchies (Blodgett et al., 2020).
In this paper, we propose and evaluate learningbased approaches to recognize dialect features. We focus on Indian English, given the availability of domain expertise and labeled corpora for evaluation. First, we consider a standard multitask classification approach, in which a pretrained transformer (Vaswani et al., 2017) is fine-tuned to recognize a set of dialect features. The architecture can be trained from two possible sources of supervision: (1) thousands of labeled corpus examples, (2) a small set of minimal pairs, which are hand-crafted examples designed to highlight the key aspects of each dialect feature (as in the "typical example" field of Figure 1). Because most dialects have little or no labeled data, the latter scenario is more realistic for most dialects. We also consider a multitask architecture that learns across multiple features by encoding the feature names, similar to recent work on few-shot or zero-shot multitask learning (Logeswaran et al., 2019;Brown et al., 2020).
In Sections 4 and 5, we discuss empirical evaluations of these models. Our main findings are: • It is possible to detect individual dialect features: several features can be recognized with reasonably high accuracy. Our best models achieve a macro-AUC of .848 across ten grammatical features for which a large test set is available.
• This performance can be obtained by training on roughly five minimal pairs per feature. Minimal pairs are significantly more effective for training than a comparable number of corpus examples.
• Dialect feature recognizers can be used to rank documents by their density of dialect features, enabling within-dialect density computation for Indian English and accurate classification between Indian and U.S. English.

Data and Features of Indian English
We develop methods for detecting 22 dialect features associated with Indian English. Although India has over 125 million English speakers -making it the world's second largest English-speaking population -there is relatively little NLP research focused on Indian English. Our methods are not designed exclusively for specific properties of Indian English; many of the features that are associated with Indian English are also present in other dialects of English. We use two sources of data in our study: an annotated corpus ( § 2.1) and a dataset of minimal pairs ( § 2.2). For evaluation, we use corpus annotations exclusively. The features are described in Table 1, and our data is summarized in Table 2.

Corpus Annotations
The International Corpus of English (ICE; Greenbaum and Nelson, 1996) is a collection of corpora of world varieties of English, organized primarily by the national origin of the speakers/writers. We focus on annotations of spoken dialogs (S1A-001 -S1A-090) from the Indian English subcorpus (ICE-India). The ICE-India subcorpus was chosen in part because it is one of the only corpora with large-scale annotations of dialect features. To contrast Indian English with U.S. English ( § 4), we use the Santa Barbara Corpus of Spoken American English (Du Bois et al., 2000) that constitutes the ICE-USA subcorpus of spoken dialogs.
We work with two main sources of dialect feature annotations in the ICE-India corpus: Lange features. The first set of annotations come from Claudia Lange (2012), who annotated 10 features in 100 transcripts for an analysis of discoursedriven syntax in Indian English, such as topic marking and fronting. We use half of this data for training (50 transcripts, 9392 utterances), and half for testing (50 transcripts, 9667 utterances).
Extended features. To test a more diverse set of features, we additionally annotated 18 features on a set of 300 turns randomly selected from the conversational subcorpus of ICE-India, 2 as well as 50 examples randomly selected from a secondary dataset of sociolinguistic interviews (Sharma, 2009) to ensure diverse feature instantiation. We selected our 18 features based on multiple criteria: 1) prevalence in Indian English based on the dialectology literature, 2) coverage in the data (we started out with a larger set of features and removed those with fewer than two occurrences), 3) diversity of linguistic phenomena. The extended

Minimal Pairs
For each of the 22 features in Table 1, we created a small set of minimal pairs. The pairs were created by first designing a short example that demonstrated the feature, and then manipulating the example so that the feature is absent. This "negative" example captures the envelope of variation for the feature, demonstrating a site at which the feature could be applied (Labov, 1972 For most features, each minimal pair contains exactly one positive and one negative example. However, in some cases where more than two variants are available for an example (e.g., for the feature INVARIANT TAG (isn't it, no, na)), we provide multiple positive examples to illustrate different variants. For Lange's set of 10 features, we provide a total of 113 unique examples; for the 18 extended features, we provide a set of 208 unique examples, roughly split equally between positives and negatives. The complete list of minimal pairs is included in Appendix D.

Models and training
We train models to recognize dialect features by fine-tuning the BERT-base uncased transformer architecture (Devlin et al., 2019). We consider two strategies for constructing training data, and two architectures for learning across multiple features.

Sources of supervision
We consider two possible sources of supervision: Minimal pairs. We apply a simple procedure to convert minimal pairs into training data for classification. The positive part of each pair is treated as a positive instance for the associated feature, and the negative part is treated as a negative instance. Then, to generate more data, we also include elements of other minimal pairs as examples for each feature: for instance, a positive example of the RESUMPTIVE OBJECT PRONOUN feature would be a negative example for FOCUS only, unless the example happened to contain both features (this was checked manually). In this way, we convert the minimal pairs into roughly 113 examples per feature for Lange's features and roughly 208 examples per feature for the extended features. The total number of unique surface forms is still 113 and 208 respectively. Given the lack of labeled data for most dialects of the world, having existing minimal pairs or collecting a small number of minimal pairs is the most realistic data scenario.
Corpus annotations. When sufficiently dense annotations are available, we can train a classifier based on these labeled instances. We use 50 of the ICE-India transcripts annotated by Lange, which consists of 9392 labeled examples (utterances) per feature. While we are lucky to have such a large resource for the Indian English dialect, this highresource data scenario is rare.

Architectures
We consider two classification architectures: Multihead. In this architecture, which is standard for multitask classification, we estimate a linear prediction head for each feature, which is simply a vector of weights. This is a multitask architecture, because the vast majority of model parameters from the input through the deep BERT stack remain shared among dialect features. The prediction head is then multiplied by the BERT embedding for the [CLS] token to obtain a score for a feature's applicability to a given instance.

DAMTL.
Due to the few-shot nature of our prediction task, we also consider an architecture that attempts to exploit the natural language descriptions of each feature. This is done by concatenating the feature description to each element of the minimal pair. The instance is then labeled for whether the feature is present. This construction is shown in Figure 2. Prediction is performed by learning a single linear prediction head on the [CLS] token. We call this model description-aware multitask learning, or DAMTL.
Model details. Both architectures are built on top of the BERT-base uncased model, which we fine-tune by cross-entropy for 500 epochs (due to the small size of the training data) using the Adam optimizer (Kingma and Ba, 2014), batch size of 32 and a learning rate of 10 −5 , warmed up over the first 150 epochs. Annotations of dialect features were not used for hyperparameter selection. Instead, the hyperparameters were selected to maximize the discriminability between corpora of Indian and U.S. English, as described in § 5.2. All models trained in less than two hours on a pod of four v2 TPU chips, with the exception of DAMTL on corpus examples, which required up to 18 hours.

Regular Expressions
In dialectology, regular expression pattern matching is the standard tool for recognizing dialect features (e.g., Nerbonne et al., 2011). For the features

Results on Dialect Feature Detection
In this section, we present results on the detection of individual dialect features. Using the features shown in Table 1, we compare supervision sources (corpus examples versus minimal pairs) and classification architectures (multihead versus DAMTL) as described in § 3. To avoid tuning a threshold for detection, we report area under the ROC curve (ROC-AUC), which has a value of .5 for random guessing and 1 for perfect prediction. 5

Results on Lange Data and Features
We first consider the 10 syntactic features from Lange (2012), for which we have large-scale annotated data: the 100 annotated transcripts from the ICE-India corpus are split 50/50 into training and test sets. As shown in situations are by far the most common data scenario among the dialects of the world.
The multihead architecture outperforms DAMTL on both corpus examples and minimal pairs. In an ablation, we replaced the feature descriptions with non-descriptive identifiers such as "Feature 3". This reduced the Macro-AUC from to .80 with corpus examples, and to .76 with minimal pairs (averaged over five random seeds). We also tried longer feature descriptions, but this did not improve performance.
Unsurprisingly, the lexical features (e.g., FOCUS itself ) are easiest to recognize. The more syntactical features (e.g., COPULA OMISSION, RESUMP-TIVE OBJECT PRONOUN) are more difficult, although some movement-based features (e.g., LEFT

DISLOCATION, RESUMPTIVE SUBJECT PRONOUN)
can be recognized accurately.
Qualitative model comparison. We conducted a qualitative comparison of three models: regular expressions and two versions of the multihead model, one trained on corpus examples and another trained on minimal pairs. Table 4 includes illustrative examples for the Lange data and features where models make different predictions. We find that the minimal pair model is better able to account for rare cases (e.g. use of non-focus "only" in Example 1), likely as it was trained on a few carefully selected set of examples illustrating positives and negatives. Both multihead models are able to account for disfluencies and restarts, in contrast to regular expressions (Example 2). Our analysis shows that several model errors are accounted for by difficult examples (Example 3: "is there" followed by "isn't"; Example 6: restart mistaken for left dislocation) or the lack of contextual information available to the model (Example 4 & 7: truncated examples). Please see Appendix B for more details and random samples of model predictions.
Learning from fewer corpus examples. The minimal pair annotations consist of 113 examples; in contrast, there are 9392 labeled corpus examples, requiring far more effort to create. We now consider the situation when the amount of labeled data is reduced, focusing on the Lange features (for which labeled training data is available). As shown in Figure 3, even 5000 labeled corpus examples do not match the performance of training on roughly 5 minimal pairs per feature.    Corpus examples stratified by feature. One reason that subsampled datasets yield weaker results is that they lack examples for many features. To enable a more direct comparison of corpus examples and minimal pairs, we created a set of "stratified" datasets of corpus examples, such that the number of positive and negative examples for each feature exactly matches the minimal pair data. Averaged over ten such random stratified samples, the multihead model achieves a Macro-AUC of .790 (σ = 0.029), and DAMTL achieves a Macro-AUC of .722 (σ = .020). These results are considerably worse than training on an equivalent number of minimal pairs, where the multihead model achieves a Macro-AUC of .848 and DAMTL achieves a Macro-AUC of .783. This demonstrates the utility of minimal pairs over corpus examples for learning to recognize dialect features.

Results on Extended Feature Set
Next, we consider the extended features, for which we have sufficient annotations for testing but not training (Table 1). Here we compare the DAMTL and multihead models, using minimal pair data in both cases. As shown in Table 5, performance on these features is somewhat lower than on the Lange features, and for several features, at least one of the recognizers does worse than chance: DIRECT OB-JECT PRO-DROP, EXTRANEOUS ARTICLE, MASS NOUNS AS COUNT NOUNS. These features seem to require deeper syntactic and semantic analysis, which may be difficult to learn from a small number of minimal pairs. On the other extreme, features with a strong lexical signature are recognized with high accuracy: GENERAL EXTENDER and all, FO-CUS itself , FOCUS only. These three features can also be recognized by regular expressions, as can  However, for a number of other features, it is possible to learn a fairly accurate recognizer from just five minimal pairs:   (Benor, 2010). This necessitates a more nuanced description for speakers and texts than a discrete dialect category. Following prior work (e.g., Van Hofwegen and Wolfram, 2010) we construct dialect density measures from feature detectors by counting the predicted number of features in each utterance, and dividing by the number of tokens. For the learningbased feature detectors (minimal pairs and corpus examples), we include partial counts from the detection probability; for the regular expression detectors, we simply count the number of matches and dividing by the number of tokens. In addition, we construct a DDM based on a document classifier: we train a classifier to distinguish Indian English from U.S. English, and then use its predictive probability as the DDM. These DDMs are then compared on two tasks: distinguishing Indian and U.S. English, and correlation with the density of expert-annotated features. The classifier is trained by fine-tuning BERT, using a prediction head on the [CLS] token.

Ranking documents by dialect density
One application of dialect feature recognizers is to rank documents based on their dialect density, e.g. to identify challenging cases for evaluating downstream NLP systems, or for dialectology research. We correlate the dialect density against the density of expert-annotated features from Lange (2012), both measured at the transcript-level, and report the Spearman rank-correlation ρ.
As shown in Table 6, the document classifier performs poorly: learning to distinguish Indian and U.S. English offers no information on the density of Indian dialect features, suggesting that the model is attending to other information, such as topics or entities. The feature-based model trained on labeled examples performs best, which is unsurprising because it is trained on the same type of features that it is now asked to predict. Performance is weaker when the model is trained from minimal pairs. Minimal pair training is particularly helpful on rare features, but offers far fewer examples on the high-frequency features, which in turn dominate the DDM scores on test data. Regular expressions perform well on this task, because we happen to have regular expressions for the highfrequency features, and because the precision issues are less problematic in aggregate when the DDM is not applied to non-dialectal transcripts.

Dialect Classification
Another application of dialect feature recognizers is to classify documents or passages by dialect (Dunn, 2018). This can help to test the performance of downstream models across dialects, assessing dialect transfer loss (e.g., Blodgett et al., 2016), as well as identifying data of interest for manual dialectological research. We formulate a classification problem using the ICE-India and the Santa Barbara Corpus (ICE-USA). Each corpus is divided into equal-size training and test sets. The training corpus was also used for hyperparameter selection for the dialect feature recognition models, as described in § 3.2.
The dialect classifier was constructed by building on the components from § 5.1. For the test set, we measure the D ("D-prime") statistic (Macmillan and Creelman, 1991), . (1) This statistic, which can be interpreted similarly to a Z-score, quantifies the extent to which a metric distinguishes between the two populations. We also report classification accuracy; lacking a clear way to set a threshold, for each classifier we balance the number of false positives and false negatives. As shown in Table 6, both the document classifier and the corpus-based feature detection model (trained on labeled examples) achieve high accuracy at discriminating U.S. and Indian English. The D discriminability score is higher for the document classifier, which is trained on a cross-entropy objective that encourages making confident predictions. Regular expressions suffer from low precision because they respond to surface cues that may be present in U.S. English, even when the dialect feature is not present (e.g., the word "only", the phrase "is there").

Related Work
Dialect classification. Prior work on dialect in natural language processing has focused on distinguishing between dialects (and closely-related languages). For example, the VarDial 2014 shared task required systems to distinguish between nationlevel language varieties, such as British versus U.S. English, as well as closely-related language pairs such as Indonesian versus Malay (Zampieri et al., 2014); later evaluation campaigns expanded this   (2015) designed lexical patterns to identify non-standard spellings that match known phonological variables from AAVE (e.g., sholl 'sure'), demonstrating the presence of these variables in social media posts from regions with high propor-tions of African Americans. Blodgett et al. (2016) use the same geography-based approach to test for phonological spellings and constructions corresponding to syntactic variables such as habitual be; Hovy et al. (2015) show that a syntactic feature of Jutland Danish can be linked to the geographical origin of product reviews. These approaches have focused mainly on features that could be recognized directly from surface forms, or in some cases, from part-of-speech (POS) sequences. In contrast, we show that it is possible to learn to recognize features from examples, enabling the recognition of features for which it is difficult or impossible to craft surface or POS patterns. (2020) to improve data efficiency, but is methodologically closer to probing work that uses minimal pairs to represent specific linguistic features.

Conclusion
We introduce the task of dialect feature detection and demonstrate that it is possible to construct dialect feature recognizers using only a small number of minimal pairs -in most cases, just five positive and negative examples per feature. This makes it possible to apply computational analysis to the many dialects for which labeled data does not exist. Future work will extend this approach to multiple dialects, focusing on cases in which features are shared across two or more dialects. This lays the groundwork for the creation of dialectbased "checklists" (Ribeiro et al., 2020) to assess the performance of NLP systems across the diverse range of linguistic phenomena that may occur in any given language.

Ethical Considerations
Our objective in building dialect feature recognizers is to aid developers and researchers to effectively benchmark NLP model performance across and within different dialects, and to assist social scientists and dialectologists studying dialect use. The capability to detect dialectal features may enable developers to test for and mitigate any unintentional and undesirable biases in their models towards or against individuals speaking particular dialects. This is especially important because dialect density has been documented to correlate with lower socioeconomic status (Sahgal and Agnihotri, 1988). However, this technology is not without its risks. As some dialects correlate with ethnicities or countries of origin, there is a potential dual use risk of the technology being used to profile individuals. Dialect features could also be used as predictors in downstream tasks; as with other proxies of demographic information, this could give the appearance of improving accuracy while introducing spurious correlations and imposing disparate impacts on disadvantaged groups. Hence we recommend that developers of this technology consider downstream use cases, including malicious use and misuse, when assessing the social impact of deploying and sharing this technology.
The focus on predefined dialect features can introduce a potential source of bias if the feature set is oriented towards the speech of specific subcommunities within a dialect. However, analogous issues can arise in fully data-driven approaches, in which training corpora may also be biased towards subcommunities of speakers or writers. The feature-based approach has the advantage of making any such bias easier to identify and correct.  A Regular Expressions Table 7 shows the regular expressions that we used for the five features, where such patterns were available.

B Sample Outputs
The examples below represent a random sample of the multihead models' outputs for Lange's features, comparing the one that is trained on corpus examples (CORPUS) to the one that is trained on minimal pairs (MINPAIR). We show true positives (TP), false positives (FP) and false negatives (FN). We randomly sample three examples for each output type (TP, FP, FN) and model (BOTH, CORPUS only, MINPAIR only). Our manual inspection shows a few errors in the human annotation by Lange and that certain false positives should be true positives, especially for FO-CUS only. We highlight such examples in green . Among the rest of the false positives and false negatives, a large proportion of errors can be explained by contextual information that is not available to the models. For example, without context it is ambiguous whether "we possess only" is an example of FOCUS only. Inspection of context shows that it is a truncated utterance, representing a standard use of only, hence it is correctly characterized as a false positive. Another source of confusion to the model is missing punctuation. For example "Both girls I have never left them alone till now" could be construed as OBJECT FRONTING with RESUMP-TIVE OBJECT PRONOUN. However, in the original context, the example consists of multiple sentences: "Two kids. Both girls. I have never left them alone till now." We removed punctuation from examples, since in many cases automatic ASR models do not produce punctuation either. However, this example demonstrates that punctuation can provide valuable information about clause and phrase boundaries, and should be included if possible.