Women’s Syntactic Resilience and Men’s Grammatical Luck: Gender-Bias in Part-of-Speech Tagging and Dependency Parsing

Several linguistic studies have shown the prevalence of various lexical and grammatical patterns in texts authored by a person of a particular gender, but models for part-of-speech tagging and dependency parsing have still not adapted to account for these differences. To address this, we annotate the Wall Street Journal part of the Penn Treebank with the gender information of the articles’ authors, and build taggers and parsers trained on this data that show performance differences in text written by men and women. Further analyses reveal numerous part-of-speech tags and syntactic relations whose prediction performances benefit from the prevalence of a specific gender in the training data. The results underscore the importance of accounting for gendered differences in syntactic tasks, and outline future venues for developing more accurate taggers and parsers. We release our data to the research community.


Introduction
Sociolinguistic studies have shown that people use grammatical features to signal the speakers' membership in a demographic group, with a focus on gender (Vigliocco and Franck, 1999;Mondorf, 2002;Eckert and McConnell-Ginet, 2013). Mondorf (2002) shows systemic differences in the usage of various types of clauses and their positions for men and women, stating that women have a higher usage of adverbial (accordingly, consequently 1 ), causal (since, because), conditional (if, when) and purpose (so, in order that) clauses, while men tend to use more concessive clauses (but, although, whereas). Similar results hold across various languages in Johannsen et al. (2015).
This correlation between grammatical features and gender has important ramifications for statistical models of syntax: if the training sample is unbalanced, these differences inadvertently introduce a strong gender bias into the training data. Such demographic imbalances are amplified by the model (Zhao et al., 2017), which in turn can be detrimental to members of the underrepresented demographic groups (Jørgensen et al., 2015;Hovy and Spruit, 2016). Since several works use syntactic analysis to improve tasks ranging from data-driven dependency parsing (Gadde et al., 2010) to sentiment classification (Moilanen and Pulman, 2007;Socher et al., 2013), underlying model biases end up affecting the performance of a wide range of applications. While data bias can be overcome by accounting for demographics, and can even improve classification performance (Volkova et al., 2013;Hovy, 2015;Bolukbasi et al., 2016;Benton et al., 2017;Zhao et al., 2017;Lynn et al., 2017), there is still little understanding on the amount and sources of bias in most training sets.
In order to address gender bias in part-of-speech (POS) tagging and dependency parsing, we first require an adequate size data set labeled for a) syntax along with b) gender information of the authors. However, existing data sets fail to meet both criteria: data sets with gender information are either too small to train on, lack syntactic information, or are restricted to social media; sufficiently large syntactic data sets are not labeled with gender information and rely (at least in part) on news genre corpora such as the Wall Street Journal (WSJ). To address this problem, we augment the WSJ subset of the Penn Treebank corpus with gender, based on author first name. To our knowledge, this is the first work that explores syntactic tagging while accounting for gender.
Contributions. The main contributions of this paper are as follows: • We annotate a standard POS-tagging and dependency parsing data set with gender information. • We conduct experiments and show the role played by gender information in POS-tagging and syntactic parsing. • We analyze POS and syntactic differences related to author gender.

Annotating PTB for Gender
The Penn Treebank (Marcus et al., 1993) is the de facto data set used to train many of the POS taggers (Brill, 1994;Ratnaparkhi, 1996;Toutanova and Manning, 2000;Toutanova et al., 2003) and syntactic parsers Nivre and Scholz, 2004;Chen and Manning, 2014). It contains articles published in the WSJ in 1989, as well as a small sample of ATIS-3 material, totalling over one million tokens, and manually annotated with POS tags and syntactic parse trees. We supplement the WSJ articles with metadata from the ProQuest Historical Newspapers database, which indexes, among others, WSJ articles released between 1923 and 2000, and provides fields such as author names. Out of the original 2,499 WSJ articles, 1,814 are found in Pro-Quest and their metadata is retrieved. 556 articles with an empty Author field are removed, resulting in 1,258 WSJ articles with author information. Using a combination of regular expressions and manual verification, we extract author names for 1,006 articles (the remaining 252 articles do not have actual author names).
We isolate the first names using regular expressions, and follow Prabhakaran and Rambow (2017) to automatically assign gender and compute a gender ambiguity score taking into consideration: (1) the list of first names obtained based on Facebook profiles by Tang et al. (2011); and (2) the Social Security Administration's (SSA) baby names data set. 2 The Facebook list has male and female assignment scores for each name, while the SSA maintains a data set of counts for baby names and gender for each year since the 1880s. If both databases agree in their gender assignment, we use that as the final label (987 articles). For the remaining 19, we manually identify the author gen-der by cross-referencing the names online. 5 of these only had a first name initial, and thus could not be resolved and were discarded. The gender mapping results in 1,001 gender tagged WSJ articles. Discarding 115 articles with joint authorship and considering only articles with both POS tags and parse trees results in a final set of 804 articles from the Treebank.
The final set of articles includes 379 unique authors, with a heavy gender imbalance of 1 to 3 (96 female and 283 male). The total number of sentences in female articles is 7,282, with a mean of 21.17 tokens per sentence (σ = 10.03), while the male articles consist of 19,400 sentences, with a mean of 20.99 tokens per sentence (σ = 10.52). This is similar to the findings of Cornett (2014), who also notes a lengthier utterance mean for women versus men (her study focuses on adolescents).
We use the Universal Dependencies (UD) v1.4 (Nivre et al., 2016) annotation guidelines for parse trees and POS tags, and accordingly, convert the constituency trees from the Penn Treebank (PTB) format to the CoNNL format. 3 We then map the POS tags to the universal POS tag set. 4

The Effect of Gender in POS Tagging and Dependency Parsing
To assess whether author gender affects parsing performance, we train the state-of-theart transition-based neural network model Syn-taxNet 5 (Andor et al., 2016) on the data (with default parameters), and test whether stratified training can alleviate these effects. We evaluate performance for individual POS-tags and dependency relations, as well as over all the tags and relations.
Stratifying the Training Data. Since the WSJ data has a heavy gender imbalance (1:3 female to male articles), we stratify the data by discarding male examples so that the number of female and male sentences and tokens do not differ by more than 15%: (1) We sort the female and male WSJ sentences in descending order of number of tokens.
(2) For each female sentence F i with f i number of tokens, we select a male sentence M j such that the number of tokens m j ∈ [0.75f i , 1.25f i ].
(3) If we run out of male sentences which qualify for this condition, we choose the next male sentence in descending order with number of tokens m j ∈ [5, 30]. Table 1 shows the number of sentences and tokens in the WSJ data before and after balancing for gender. We train the model in three scenarios: (1) on female data, (2) on male data, and (3) on generic data containing an equal number of male and female sentences. All three data sets have an equal number of sentences. Evaluation. We report standard evaluation metrics: accuracy (ACC) -the percentage of tokens that have a correct assignment to their part-ofspeech (for part-of-speech tagging); and labeled attachment score (LAS) -the percentage of tokens that have a correct assignment to their heads and the correct dependency relation ) (for dependency parsing).
In each training setting, we generate five random training-test splits at a 90:10 ratio on the WSJ data set. In order to derive parameters for Syn-taxNet, each train split is further randomly split into five folds. When creating the folds, we ensure that sentences authored by the same author are not shared across splits to avoid overfitting to the writing styles of individual authors, rather than learning the underlying gender-based differences as they pertain to syntax.  In each training scenario, we evaluate the models on: (1) female-only data, (2) male-only data, and (3) generic data containing an equal number of male and female sentences (364 sentences from each gender), such that all test settings share the same number of sentences (10% of 7,282 = 728). Since we have 5 test folds, and each fold in turn has 5 validation folds (for parameter tuning), we report results averaged over the 25 total runs to ensure robustness.

Results and Discussion
Table 2 (top) shows the POS-tagging accuracies for labeling the WSJ test data. We should note that while accuracy differences may be relatively small, they are within the margins of recent stateof-the-art improvements (Andor et al., 2016) in a task that achieves extremely high accuracy and where further improvement can only be incremental. Considering performance across the three different training scenarios, the female test data sees a slight benefit from a mixed training set, achieving its highest accuracy of 95.96%, while male test data only achieves the highest performance (96.08%) when training on male-only data, representing a relative error rate reduction of 13.46% when compared to the generic model.
The setting closest to current POS tagging setups is embodied by training on the generic model. In this case, the female test data achieves its highest accuracy (95.96%), but the male test data achieves only a second best performance (95.47%). This difference suggests an area of possible improvement in performance for off-theshelf POS taggers.
We see a similar pattern in dependency parsing (Table 2, bottom), where the female test set achieves the highest LAS accuracy performance on the mixed training set (83.46%). The male test set obtains its highest accuracy when the training is performed on male-only data, with a relative error reduction of 3.89% as compared to training on generic data.
It seems that female writings are more diverse, with a complexity that can best be approximated with mixed-gender training samples. This setting improves performance by relative error reductions of (1.46%, 1.72%) (ACC, LAS) when compared to training on female-only data, and (10.82%, 2.01%) (ACC, LAS) when compared to training on male-only data. The male test sentences appear to display less variability, and therefore can-not benefit the same amount of information from the spectrum displayed by female training data; actually, any time female-authored sentences are present in the training set (whether as all femaledata or generic data), performance drops for male test data.
When comparing male and female-only training sets and their ability to generalize to the opposite gender, we notice that male training data is more maleable and lends itself better to be used when testing on female samples, but not the reverse.
We note that the WSJ exemplifies a highly formal and scripted newswire genre, where gender differences are likely less pronounced, yet they still surface. We will likely observe even stronger language differences in a large, informal data set comprising both gender and syntactic information. These differences can be leveraged to achieve a better performance for core NLP tasks.  Table 3: Tag-wise results for part-of-speech tagging on WSJ test data; Accuracies (Acc) and relative error reduction rates (Err) versus generic models are reported.
We also observe clear gender-based performance improvements at the tag level (Table 3). For instance, models trained on male-only data better predict nouns, determiners, numerals, pronouns and proper nouns for male test data, compared to models trained on mixed data (with a relative error rate reduction between 2.75% and 21.41%). Similarly, female-trained models better predict pronouns, auxiliaries, adjectives, and proper nouns for female test data, compared to models trained on mixed data (with a relative error rate reduction between 5.76% and 18.99%). For 8 out of the 16 POS tags, mixed training achieves best results for either female or male test data.  In dependency parsing (Table 4), models trained on female data better predict amod, cop, appos, and cc:preconj labels for female test sets (with a relative error rate reduction between 3.11% and 22.96% compared to generic models). Similarly, male-trained models are able to outperform mixed models on male test data for csubj, iobj, acl, compound, xcomp, dobj, conj and nummod with a relative error rate reduction between 2.11% and 14.61%. In dependency parsing, mixed training never achieves the best per tag results for either male or female test sets.
This suggests that leveraging the idiosyncrasies for specific tags displayed by each gender could help create gender-agnostic models that leverage the syntactic strengths of each gender, and improve prediction accuracy for both. It is to be noted that there is a heavy topic overlap between the male and female WSJ articles, with a Pearson correlation of 0.85 between the male and female topic distributions 6 , indicating that the differences in performance between male and female models on various evaluation sets are not from topical shifts, but from syntactic variations.

Conclusion
Our experiments show that women's syntax displays resilience: POS taggers and dependency parsers trained on any data perform well when tested on female writings. Male syntax, on the other hand, is parsed or tagged best when sufficient male-authored data is available in the training set. This suggests that men "lucked out" with respect to the gender imbalance in the WSJ training data: a more balanced or more female-heavy data set could have caused significant drops in the performance of automatic syntax analysis for male writers. The gender annotated WSJ data provides a starting point for leveraging gendered grammatical differences and the development of better and fairer models and tools for syntax annotation, as well as for the many NLP down-stream tasks that use syntax in their models.
The WSJ author gender information is publicly available from http://lit.eecs.umich. edu/downloads.html.