The State and Fate of Linguistic Diversity and Inclusion in the NLP World

Language technologies contribute to promoting multilingualism and linguistic diversity around the world. However, only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications. In this paper we look at the relation between the types of languages, resources, and their representation in NLP conferences to understand the trajectory that different languages have followed over time. Our quantitative investigation underlines the disparity between languages, especially in terms of their resources, and calls into question the “language agnostic” status of current models and systems. Through this paper, we attempt to convince the ACL community to prioritise the resolution of the predicaments highlighted here, so that no language is left behind.


The Questions
Languages X and Y are the official languages of two different countries; they have around 29M and 18M native speakers, and 2M and 5.5K Wikipedia articles, respectively. X is syntactically quite similar to English, though uses dimunitives and has grammatical gender. Y, on the other hand, has a different word order from English, and has a rare typological feature -generally it is a head-final language, but noun phrases are head-initial. It also features full and partial reduplication. 69 items on LDC and ELRA contain data in X, whereas for Y there are only 2 items. X boasts of some of the best online machine translation systems, whereas Y is supported by very few online MT systems and that too with far inferior translation quality. mention X and Y in the paper, across the years. As you can see, while X has a steady and growing trend of research, our community has been mostly oblivious to Y, until recently when some of the zero-shot learning papers have started mentioning it. Can you guess what X and Y are? Regardless of whether you can guess the exact answer, most NLP researchers surely know of (and might even speak) several languages which are in the same boat as X; languages which have a large amount of resources and therefore access to the benefits of the current NLP breakthroughs, and languages like Y; those which lack resources and consequently the attention of the NLP community, despite having similar speaker base sizes and typologically diverse features.
You probably have come across the issue of extremely skewed distribution of resources across the world's languages before. You might also be aware of the fact that most of our NLP systems, which are typically declared language agnostic, are not truly so (Bender, 2011). The handful of languages on which NLP systems are trained and tested are often related and from the same geography, drawn from a few dominant language families, leading to a typological echo-chamber. As a result, a vast majority of typologically diverse linguistic phenomena are never seen by our NLP systems (Ponti et al., 2019).
Nevertheless, it would be prudent to re-examine these issues in the light of recent advances in deep learning. Neural systems, on one hand, require a lot more data for training than rule-based or traditional ML systems, creating a bigger technological divide between the Xs and Ys; yet, some of the most recent techniques on zero-shot learning of massively multilingual systems (Devlin et al., 2019;Conneau and Lample, 2019;Aharoni et al., 2019;Artetxe and Schwenk, 2019) bridge this gap by obliterating the need for large labeled datasets in all languages. Instead, they need only large unlabeled corpora across languages and labeled data in only some languages. Assuming that this approach can be taken to its promising end, how does the fate of different languages change?
We break down this complex prescient question into the following more tractable and quantifiable questions on Linguistic Diversity and Inclusion: 1. How many resources, labeled and unlabeled, are available across the World's languages? How does this distribution correlate to their number of native speakers? What can we expect to achieve today and in the near future for these languages?
2. Which typological features have current NLP systems been exposed to, and which typological features mostly remain unexplored by systems because we have hardly created any resources and conducted data-driven research in those languages?
3. As a community, how inclusive has ACL been in conducting and publishing research on various languages? In 1980s and early 90s, when large scale datasets were not the prime drivers of research, was the linguistic diversity of ACL higher than what it has been in 2000s and 2010s? Or has ACL become more inclusive and diverse over the years? 4. Does the amount of resource available in a language influence the research questions and the venue of publication? If so, how?

5.
What role does an individual researcher, or a research community have to play in bridging the linguistic-resource divide?
In this paper, we take a multi-pronged quantitative approach to study and answer the aforementioned questions, presented in order, in the following five sections. One of the key findings of our study, to spill the beans a bit, is that the languages of the World can be broadly classified into 6 classes based on how much and what kind of resources they have; the languages in each class have followed a distinct and different trajectory in the history of ACL, and some of the hitherto neglected classes of languages have more hope of coming to the forefront of NLP technology with the promised potential of zero-shot learning.

The Six Kinds of Languages
In order to summarize the digital status and 'richness' of languages in the context of data availability, we propose a taxonomy based on the number of language resources which exist for different languages. We frame the rest of our analyses based on this taxonomy and use it to emphasize the existence of such resource disparities.

Features
We design this taxonomy using two feature axes: number of unlabeled resources vs. number of labeled resources. Previous methods have mostly relied on supervised learning techniques which require labeled corpora. However, the advent of transfer learning methods have boosted the importance of unlabeled data: massively multilingual models such as mBERT use Wikipedia for pre-training, and then fine-tune on downstream NLP tasks. These features are suitable because the current NLP research is predominantly data-driven, and language inclusion depends on how much labeled or unlabeled data is available. We believe these features are sufficient for the taxonomical design as the required metadata is consistently available across all languages, whereas features such as number of hours required to collect data aren't available.
We treat each data resource as a fundamental unit, based on the assumption that the collection of one unit is proportional to a certain extent of effort being invested towards the resource improvement of that language. Moreover, this feature discretization is unambiguous and concrete. Other units such as the total number of datapoints across datasets can be misleading because different NLP tasks have different data requirements. For example, while Machine Translation (MT) models require datapoints to the order of millions (Koehn and Knowles, 2017) to perform competitively, competent models in Question Answering require around 100 thousand datapoints (Rajpurkar et al., 2016). Moreover, the unit of datapoints vary across different technologies (e.g. Speech data measured in hours, MT data measured in number of parallel sentences). The size of the gradient circle represents the number of languages in the class. The color spectrum VIBGYOR, represents the total speaker population size from low to high. Bounding curves used to demonstrate covered points by that language class.

Repositories
We focus our attention on the LDC catalog 1 and the ELRA Map 2 for labeled datasets. Although there are other repositories of data available online, we found it practical to treat these organized collections as a representation of labeled dataset availability. This way, we look at standardized datasets that have established data quality and consistency, and which have been used in prior work. There are strong efforts such as PanLex (Kamholz et al., 2014), which is a large lexical database of a wide range of languages being used for a lexical translator, and OLAC (Simons and Bird, 2003), which contains a range of information for different languages (e.g. text collections, audio recordings, and dictionaries). However, keeping within the purview of NLP datasets used in *CL conferences, we decided to focus on popular repositories such as the above-mentioned.
We look at Wikipedia pages as a measure for unlabeled data resources. With regards to language technologies, Wikipedia pages represent a strong source of unsupervised training data which are freely and easily accessible. In the perspective of digital resource availability, they are a comprehensive source of factual information and are accessed by a large, diverse set of online users.

Language Classes
Figure 2 is a visualization of the taxonomy. We find a set of distinct partitions which can be used 1 https://catalog.ldc.upenn.edu/ 2 http://catalog.elra.info/en-us/ to categorize languages into 6 unique positions in the language resource 'race': 0 -The Left-Behinds These languages have been and are still ignored in the aspect of language technologies. With exceptionally limited resources, it will be a monumentous, probably impossible effort to lift them up in the digital space. Unsupervised pre-training methods only make the 'poor poorer', since there is virtually no unlabeled data to use.

-The Scraping-Bys
With some amount of unlabeled data, there is a possibility that they could be in a better position in the 'race' in a matter of years. However, this task will take a solid, organized movement that increases awareness about these languages, and also sparks a strong effort to collect labelled datasets for them, seeing as they have almost none.

-The Hopefuls
With light at the end of the tunnel, these languages still fight on with their gasping breath. A small set of labeled datasets has been collected for these languages, meaning that there are researchers and language support communities which strive to keep them alive in the digital world. Promising NLP tools can be created for these languages a few years down the line.

-The Rising Stars
Unsupervised pre-training has been an energy boost for these languages. With a strong web presence, there is a thriving cultural community online for them. However, they have been let down by insufficient efforts in labeled data collection. With the right steps, these languages can be very well off if they continue to ride the 'pre-training' wave.  languages, reaping benefit from each state-of-theart NLP breakthrough.
Some more information about the taxonomy is shown in Table 1. We also take 10 languages, and annotate their positions in Figure 3.

Findings
On your marks As can be seen in Figure 3, the Winners take pole position in all rankings, and Class 0 languages remain 'out of the race' with no representation in any resource. The Wikipedia distribution seems to be more fair for classes 1, 2, and 3 when compared to classes 4 and 5, whereas the Web distribution has a clear disparity.
Talk ain't cheap Looking at Table 1, we see that Class 0 contains the largest section of languages and represents 15% of all speakers across classes. Although there is a large chunk of speakers which converse with Class 5 languages, the lack of technological inclusion for different languages could draw native speakers away from Class 0 languages and towards Class 5, exacerbating the disparity.

Typology
Linguistic typology is a field which involves the classification of languages based on their structural and semantic properties. Large-scale efforts have led to the creation of a database of typological features (Dryer and Haspelmath, 2013). Such documentation becomes important as there are barely any other classifications of similar scale. In the context of NLP research, there has been work indicating the effectiveness of injecting typological information to guide the design of models (Ponti et al., 2019). Also, transfer learning of resourcerich to resource-poor languages have been shown to work better if the respective languages contain similar typological features (Pires et al., 2019). We look at how skewed language resource availability leads to an under-representation of certain typological features, which may in turn cause zero-shot inference models to fail on NLP tasks for certain languages.
We look at the WALS data (Dryer and Haspelmath, 2013), which contains typological features for 2679 languages. There are a total of 192 typological features, with an average of 5.93 categories per feature. We take the languages in classes 0, 1, 2, all of which have limited or no data resources as compared to 3, 4, 5 and look at how many categories, across all features, exist in classes 0, 1, 2 but not 3, 4, 5. This comes to a total of 549 out of 1139 unique categories, with an average of 2.86 categories per feature being ignored. Typological features with the most and least 'ignored' categories are shown in Table 2.
To get an idea of what these typological 'exclu-   sions' mean in the context of modern multilingual methods, we look at the specific languages which contain these 'excluded' categories in the respective features, and compare their performances in similarity search, from the results of Artetxe and Schwenk (2019). Table 3 shows some examples of how 'ignored' features have been difficult to deal with even when jointly training of all languages.

Findings
Far-reaching repercussions The most 'ignored' feature in Table 2, 144E (Multiple Negative Constructions in SVO Languages), is a rare feature, existing in only 38 languages over the world. These languages, however, are from various regions (e.g. Wolof, Icelandic, and Kilivila). Language tools in all these areas can be adversely affected without sufficient typological representation. On the other hand, common features such as 83A (Order of Object and Verb) are well represented with definite feature values for 1321 languages, ranging from English to Mundari.
Does it run in the family? Amharic, in Table 3, which among the Semitic family of languages, is the second most spoken language after Arabic (which has 300M speakers). However, it has 9 'ignored' typological features, whereas Arabic has none. This reflects in the error rate of English to Amharic (60.71), which is significantly worse compared to 7.8 for English to Arabic.

Conference-Language Inclusion
NLP conferences have a huge impact on how language resources and technologies are constructed.
Exciting research in venues such as ACL, EMNLP, LREC have the ability to turn heads in both industry and government and have the potential to attract funds to a particular technology. Has the usage of a small set of resource-rich languages in such conferences led to a disparity, pushing the less represented to the bottom of the ladder in terms of research? We analyze the involvement of various languages in NLP research conferences over the years.

Dataset
The ACL Anthology Corpus (ACL-ARC) (Bird et al., 2008) is the most extensively used dataset for analyzing trends in NLP research. This dataset contains PDFs, and parsed XMLs of Anthology papers. However, the latest versioned copy of ACL-ARC is till 2015 which makes it insufficient for analyzing trends in the most recent years. Moreover, paper data for non-ACL conferences such as LREC, COLING are absent from this dataset. In order to create a consistent data model, we augment this dataset by using Semantic Scholar's API and scraping ACL Anthology itself. Thus, we gather a consolidated dataset for 11 conferences which are relevant in judging global trends in NLP research. These include ACL, NAACL, EMNLP, EACL, COLING, LREC, CONLL, Workshops (WS) (all since 1990), SEMEVAL, TACL and CL Journals. We have attached the statistics of the dataset in Appendix A.

Language Occurrence Entropy
The primary step of measuring the language diversity and inclusion of a conference and their progress is to measure the usage of language in that conference over multiple iterations. One of the ways to do it is by using frequency-based techniques where we can measure the occurrence of languages in that iteration. However, it is not a unified measure which represents the nature of language distribution with a single number. To this end, we use entropy as our metric to measure language inclusivity of each conference. It efficiently captures the skew in the distribution of languages, thereby making the disparity in language usage more clearer. The language occurrence entropy is calculated as follows: For a conference c held in year y having P papers, there exists a binary matrix {M P ×L } c,y where M ij is 1 if i th paper (∈ P ) mentions the j th language (∈ L). Then the entropy {S} c,y is: where {S j } c,y is a array of length L accounting for number of papers in a specific language, {S j } c,y is normalization done in order to get probability distribution for calculating entropy. In short, the higher the entropy, the more spread out is the distribution over the languages. The more peaked or skewed the distribution is, the lower is the entropy.
In Figure 4, we can observe the entropy S plotted for each c as a function of y.

Class-wise Mean Reciprocal Rank
To quantify the extent of inclusion of language classes from our taxonomy in different conferences, we employ class-wise Mean Reciprocal Rank (MRR) as a metric. This helps in determining the standing of each class in a conference. If the rank of the language (rank i ) is ordered by the frequency of being mentioned in papers of a particular conference, and Q is the total number of queries aka number of languages in each class, then:

Findings
All-Inclusive Looking at the combined trends, both the entropy plots and the MRR figures suggest that LREC and WS have been the most inclusive across all categories and have been continuing to do so over the years.
A ray of hope With regards to the proceedings of ACL, EMNLP, NAACL, LREC, we note a marked spike in entropy in the 2010s, which is absent in other conferences. This might be due to the increased buzz surrounding cross-lingual techniques.
The later the merrier An interesting point to note is that conferences which started later have taken lessons from past in matters of language inclusion. While the earlier established conferences have continued to maintain interest in a particular underlying theme of research which may or may not favour multilingual systems. This can be observed in : COLING, ACL, EACL, EMNLP (order of their start dates).
Falling off the radar The taxonomical hierarchy is fairly evident when looking at the MRR table (Table 4) with class 5 coming within rank 2/3 and class 0 being 'left-behind' with average ranks ranging from 600 to 1000. While the dip in ranks is more forgiving for conferences such as LREC, WS, it is more stark in CONLL, TACL, SEMEVAL.

Entity Embedding Analysis
The measures discussed in the previous section signal at variance in acceptance of different languages at different NLP venues across time. However, there are usually multiple subtle factors which vanilla statistics fail to capture. Embeddings, on the other hand, have been found extensively useful in NLP tasks as they are able to learn relevant signals directly from the data and uncover these rather complex nuances. To this end, we propose a novel approach to jointly learn the representations of conferences, authors and languages, which we collectively term as entities. The proposed embedding method allows us to project these entities in the same space enabling us to effectively reveal patterns revolving around them.

Model
We define the following model to jointly learn the embeddings of entities such that entities which have similar contextual distributions should cooccur together. For example, for an author A, who works more extensively on language L i than L j and publishes more at conference C m than at conference C n , the embeddings of A would be closer L i than L j and C m than C n . Given an entity and a paper associated with the entity, the learning task of the model is to predict K randomly sampled words from the title and the abstract of the paper. We only select the title and abstract as compared to the entire paper text as y 1,j y 2,j y C,j Figure 5: Model architecture to learn entity embeddings. W E×N is the weight matrix from input layer (entity layer) to the hidden layer, and W N ×V is the weight matrix for the hidden layer to output layer computation. At the end of training, W E×N is the matrix containing embeddings of entities and W N ×V is the matrix containing the embeddings of words.
they provide a concise signal with reduced noise. This model draws parallels to the Skipgram model of Word2Vec (Mikolov et al., 2013), where given an input word in Skipgram model, the task is to predict the context around the word. The input entity and K randomly sampled words in our case correspond to the input word and context in the Skipgram model. The goal of the model is to maximize probability of predicting the random K words, given the entity id as the input: where E <i,P j > is the entity E i which is associated with the P j th paper and p is the probability of predicting the word w i out of the K words sampled from the paper and M is the total number of papers in the dataset. To optimize for the above distribution, we define the typical SGD based learning strategy similar to Word2Vec (Mikolov et al., 2013). Figure 5 shows an outline of the model. The entity input layer has dimension equal to the total number of entities in the dataset (E). Hidden layer size is set to the desired embedding dimension (N ). The output layer predicts words for the input entity and is of the same size as the vocabulary (V ). The entities we learn are: (1) authors of the paper, (2) languages mentioned in the paper, (3) conference where the paper was accepted (e.g. ACL), and (4) the conference iteration (e.g. ACL'19). We describe the model detail and hyperparameter tuning in Appendix A.

Analysis
In order to better understand how languages are represented at different venues, we visualize the distribution of entity embeddings by projecting the generated embeddings into 2 dimensions using t-SNE (Maaten and Hinton, 2008) (as shown in Figure 6). For clarity, we only plot ACL, LREC, WS and CL among the conferences, and all languages from the taxonomy, except those in Class 0. We omit plotting Class 0 languages as their projections are noisy and scattered due to their infrequent occurrence in papers.
To understand the research contributions of individual authors or communities towards research in respective language classes, we leverage the distribution between author and language entities by computing a variation of the Mean Reciprocal Rank (MRR). We consider a language L, and take the K closest authors to L using cosine distance, and then take the closest M languages to each author. If L is present in the closest languages of an author, then we take the rank of L in that list, inverse it, and average it for the K authors. To compute this metric for a class of languages from the taxonomy, we take the mean of the MRR for all languages in that class. We fix M to be 20, so as to understand the impact of the community when the number of languages remains unchanged. Table 5 shows the MRR of various class of languages. A higher value of this measure indicates a more focused community working on that particular language, rather than a diverse range of authors.

Findings
Time waits for no conference We can see a left to right trend in Figure 6 with ACL in 1983 in the left, and subsequent iterations laid out as we go right. We observe the same trend for EACL, NAACL, EMNLP, CONLL, TACL, and COLING. We can say that the axis represents the progression of time to a certain extent. Alternatively, it may even represent a shift in the focus of NLP research, moving from theoretical research focused on grammar and formalisms on the left to a data-driven, more ML-oriented approach on the right. This can be observed as most of the CL embeddings are positioned on the left given their theoretical research focus.
Long distance relationships? From Figure 6, we can note that the less-resourced language classes are farther away from the trend-line of ACL than the more resourced ones, with class 5 being closest, and class 1 being farthest. The visualization illustrates that languages are spreading out radially downwards from the ACL trendline with popular classes of taxonomy like class 5 and class 4 being closer while others spreading out farther. Again, as previous analyses have shown us, LREC and WS embeddings are closer to the language embeddings as compared to the other conferences as shown in Figure 6. In fact, LREC cluster is right in the middle of language clusters and so is the major part of the WS cluster, especially in recent iterations.
Not all heroes wear capes Table 5 shows the MRR for each class of languages in the taxonomy. From Table 5, it can be seen that class 0 has the highest MRR across different K values. This shows that perhaps low resource languages have some research groups solely focused on the challenges related to them. There is a decreasing trend of MRR from class 0 to class 5, except for class 2, thereby indicating that more popular languages are addressed by more authors. We also observe that even though Japanese, Mandarin, Turkish and Hindi (MRR(10) > 0.75) are part of class 5 and class 4, their MRR is higher even compared to low resource languages in another classes, indicating that these languages have focused research communities working on them. On the other end of the spectrum, we observe a lot of low resource languages like Burmese (MRR(10) = 0.02), Javanese (MRR(10) = 0.23) and Igbo (MRR(10) = 0.13) which have millions of speakers but significantly low MRR values, potentially indicating that not a lot of attention is being given to them in the research community.

Conclusion
We set out to answer some critical questions about the state of language resource availability and research. We do so by conducting a series of quantitative analyses through the lens of a defined taxonomy. As a result, we uncover a set of interesting insights and also yield consistent findings about language disparity: -The taxonomical hierarchy is repeatedly evident from individual resource availabilities (LDC, LRE, Wikipedia, Web), entropy calculations for conferences, and the embeddings analysis.
-LREC and Workshops(WS) have been more inclusive across different classes of languages, seen through the inverse MRR statistics, entropy plots and the embeddings projection.
-There are typological features (such as 144E), existing in languages over spread out regions, represented in many resource-poor languages but not sufficiently in resource-rich languages. This could potentially reduce the performance of language tools relying on transfer learning. -Newer conferences have been more languageinclusive, whereas older ones have maintained interests in certain themes of research which don't necessarily favour multilingual systems.
-There is a possible indication of a time progression or even a technological shift in NLP, which can be visualized in the embeddings projection.
-There is hope for low-resource languages, with MRR figures indicating that there are focused communities working on these languages and publishing works on them, but there are still plenty of languages, such as Javanese and Igbo, which do not have any such support.
We believe these findings will play a strong role in making the community aware of the gap that needs to be filled before we can truly claim state-ofthe-art technologies to be language agnostic. Pertinent questions should be posed to authors of future publications about whether their proposed language technologies extend to other languages.
There are ways to improve the inclusivity of ACL conferences. Special tracks could be initiated for low-resource, language-specific tasks, although we believe that in doing so, we risk further marginalization of those languages. Instead, a way to promote change could be the addition of D&I (Diversity and Inclusion) clauses involving language-related questions in the submission and reviewer forms: Do your methods and experiments apply (or scale) to a range of languages? Are your findings and contributions contributing to the inclusivity of various languages?
Finally, in case you're still itching to know, Language X is Dutch, and Y is Somali.

A.3 Hyperparameter Tuning
Our model has same hyperparameters as that of Word2Vec. To determine the optimal hyperparameters for the model, we take the entire dataset and split it into a 80-20 ratio, and given the embedding of a paper, the task is to predict the year in which the paper is published. Given this vector for a paper, we use a linear regression model such that given this vector, the model is supposed to predict the year in which the paper was published. We measured both R 2 measure of variance in regression and mean absolute error (MAE). R 2 is usually in the range of 0 to 1.00 (or 0 to 100%) where 1.00 is considered to be the best. MAE has no upper bound but the smaller it is the better, and 0 is its ideal value. We observed that our model does not show significant difference across any hyperparaeters except for the size of embeddings. The best dimension size for our embeddings is 75, and, we observed the corresponding R 2 value of 0.6 and an MAE value of 4.04.

A.4 Cosine distance between conferences and languages
From Figure 6, we can see that languages are somewhat below the conferences are closer to some conferences while distant from others. To quantify this analysis, we compute the cosine distance between the conference vector and the mean of the vector each category of the taxonomy. Table 7 shows the cosine distance between the conferences and the each category of languages and we see a very similar trend that while ACL is an at average distance of 0.291 from category 5 languages, its almost more than double far away from category 2. There is also a very steep rise in distance of the ACL vector from category 4 to category 3. In fact, similar trends are visible for other ACL related conferences including EACL, NAACL, EMNLP and TACL. We can also see that in Table 7, WS and LREC are closest from category 2 to category 5 whereas almost all conferences are somewhat at the same distance from category, except the CL journal. The trend for category 0 languages seems somewhat different than the usual trend is this table, probably because of the large number of languages in this category as well as the sparsity in papers.