Cross-lingual syntactic variation over age and gender

Most computational sociolinguistics studies have focused on phonological and lexical variation. We present the ﬁrst large-scale study of syntactic variation among demographic groups (age and gender) across several languages. We harvest data from online user-review sites and parse it with universal dependencies. We show that several age and gender-speciﬁc variations hold across languages, for example that women are more likely to use VP conjunctions.


Introduction
Language varies between demographic groups. To detect this variation, sociolinguistic studies require both a representative corpus of text and meta-information about the speakers. Traditionally, this data was collected from a combination of interview transcriptions and questionnaires. Both methods are time-consuming, so population sizes have been small, sometimes including less than five subjects (Rickford and Price, 2013). While these resources enable detailed qualitative analyses, small sample sizes may lead to false research findings (Button et al., 2013). Sociolinguistic studies, in other words, often lack statistical power to establish relationships between language use and socio-economic variables.
Obtaining large enough data sets becomes even more challenging the more complex the target variables are. So while syntactic variation has been identified as an important factor of variation (Cheshire, 2005), it was not approached, due to its high complexity. This paper addresses the issue systematically on a large scale. In contrast to previous work in both sociolinguistics and NLP, we consider syntactic variation across groups at the level of treelets, as defined by dependency struc-tures, and make use of a large corpus that includes demographic information on both age and gender.
The impact of such findings goes beyond sociolinguistic insights: knowledge about systematic differences among demographic groups can help us build better and fairer NLP tools. Volkova et al. (2013), , Jørgensen et al. (2015), and Hovy (2015) have shown the impact of demographic factors on NLP performance. Recently, the company Textio introduced a tool to help phrase job advertisements in a gender-neutral way. 1 While their tool addresses lexical variation, our results indicate that linguistic differences extend to the syntactic level.
Previous work on demographic variation in both sociolinguistics and NLP has begun to rely on corpora from social media, most prominently Twitter. Twitter offers a sufficiently large data source with broad coverage (albeit limited to users with access to social media). Indeed, results show that this resource reflects the phonological and morpholexical variation of spoken language (Eisenstein, 2013b;Eisenstein, 2013a;Doyle, 2014).
However, Twitter is not well-suited for the study of syntactic variation for two reasons. First, the limited length of the posts compels the users to adopt a terse style that leaves out many grammatical markers. As a consequence, performance of syntactic parsers is prohibitive for linguistic analysis in this domain. Second, Twitter provides little meta-information about the users, except for regional origin and time of posting. Existing work has thus been restricted to these demographic variables. One line of research has focused on predictive models for age and gender (Alowibdi et al., 2013;Ciot et al., 2013) to add meta-data on Twitter, but again, error rates are too high for use in sociolinguistic hypothesis testing.
We use a new source of data, namely the user review site Trustpilot. The meta-information on Trustpilot is both more prevalent and more reliable, and textual data is not restricted in length (see Table 2). We use state-of-the-art dependency parsers trained on universal treebanks (McDonald et al., 2013) to obtain comparable syntactic analyses across several different languages and demographics.
Contributions We present the first study of morpho-syntactic variation with respect to demographic variables across several languages at a large scale. We collect syntactic features within demographic groups and analyze them to retrieve the most significant differences. For the analysis we use a method that preserves statistical power, even when the number of possible syntactic features is very large. Our results show that demographic differences extend beyond lexical choice.

Data collection
The TRUSTPILOT CORPUS consists of user reviews from the Trustpilot website. On Trustpilot, users can review company websites and leave a one to five star rating, as well as a written review. The data is available for 24 countries, using 13 different languages (Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish). In our study, we are limited by the availability of comparable syntactically annotated corpora (McDonald et al., 2013) for five languages used in eleven countries, i.e., English (Australia, Canada, UK, and US), French (Belgium and France), German (Switzerland and Germany), Italian, Spanish, and Swedish. We treat the different variants of these languages separately in the experiments below. 2 Many users opt to provide a public profile. There are no mandatory fields, other than name, but many also supply their birth year, gender, and location. We crawl the publicly available information on the web site for users and reviews, with different fields. Table 1 contains a list of the fields that are available for each type of entity. For more information on the data as a source for demographic information, see .
We enhance the data set for our analysis by adding gender information based on first names. In order to add missing gender information, we 2 While this might miss some dialectal idiosyncrasies, it is based on standard NLP practice, e.g., when using WSJtrained parsers in translation of (British) Europarl.

Users
Name, ID, profile text, location (city and country), gender, year of birth Reviews Title, text, rating (1-5), User ID, Company ID, Date and time of review Table 1: Meta-information in TRUSTPILOT data measure the distribution over genders for each name. If a name occurs with sufficient frequency and is found predominantly in one gender, we propagate this gender to all occurrences of the name that lack gender information. In our experiments, we used a gender-purity factor of 0.95 (name occurs with one gender 95% of the time) and a minimum frequency of 3 (name appears at least 3 times in the data). Since names are language-specific (Angel is male in Spanish, but female in English), we run this step separately on each language. On average, this measure doubled the amount of gender information for a language. Note that the domain (reviews) potentially introduces a bias, but since our analysis is largely at the syntactic level, we expect the effect to be limited. While there is certainly a domain effect at the lexical level, we assume that the syntactic findings generalize better to other domains.

Methodology
For each language, we train a state-of-the-art dependency parser (Martins et al., 2013) on a treebank annotated with the Stanford dependency labels (McDonald et al., 2013) and universal POS tag set (Petrov et al., 2011). This gives us syntactic analyses across all languages that describe the same syntactic phenomena the same way. Figure  1 shows two corpus sentences annotated with this harmonized representation. The style of the reviews is much more canonical than social web data, say Twitter. Expected parse performance can be estimated from the SANCL 2012 shared task on dependency parsing of web data (Petrov and McDonald, 2012 From the parses, we extract all subtrees of up to three tokens (treelets). We do not distinguish between right-and left-branching relations: the representation is basically a "bag of relations". The purpose of this is to increase comparability across languages with different word orderings (Naseem et al., 2012). A onetoken treelet is simply the POS tag of the token, e.g. NOUN or VERB. A two-token treelet is a typed relation between head and dependent, e.g.

Treelet reduction
We extract between 500,000 to a million distinct treelets for each language. In principle, we could directly check for significant differences in the demographic groups and use Bonferroni correction to control the family-wise error (i.e., the probability of obtaining a false positive). However, given the large number of treelets, the correction for multiple comparisons would underpower our analyses and potentially cause us to miss many significant differences. We therefore reduce the number of treelets by two methods.
First, we set the minimum number of occurrences of a feature in each language to 50. We apply this heuristic both to ensure statistical power and to focus our analyses on prevalent rather than rare syntactic phenomena.
Second, we perform feature selection using L 1 randomized logistic regression models, with age or gender as target variable, and the treelets as input features. However, direct feature selection with L 1 regularized models (Ng, 2004) is problematic when variables are highly correlated (as in our treelets, where e.g. three-token structures can subsume smaller ones). As a result, small and inessential variations in the dataset can determine which of the variables are selected to represent the group, so we end up with random within-group feature selection.
We therefore use stability selection (Meinshausen and Bühlmann, 2010). Stability selection mitigates the correlation problem by fitting the logistic regression model hundreds of times with perturbed data (75% subsampling and feature-wise regularization scaling). Features that receive non-zero weights across many runs can be assumed to be highly indicative. Stability selection thus gives all features a chance to be selected. It controls the false positive rate, which is less conservative than family-wise error. We use the default parameters of a publicly available stability selection implementation 3 , run it on the whole data set, and discard features selected less than 50% of the time.
With the reduced feature set, we check for usage differences in demographic groups (age and gender) using a χ 2 test. We distinguish two age groups: speakers that are younger than 35, and speakers older than 45. These thresholds were chosen to balance the size of both groups. At this stage we set the desired p-value at 0.02 and apply Bonferroni correction, effectively dividing the p-value threshold by the number of remaining treelets. 4 Note, finally, that the average number of words written by a reviewer differs between the demographic groups (younger users tend to write more than older ones, women more than men). To counteract this effect, the expected counts in our null hypothesis use the proportion of words written by people in a group, rather than the proportion of people in the group (which would skew the results towards the groups with longer reviews).

Results
We are interested in robust syntactic variation across languages; that is, patterns that hold across most or all of the languages considered here. We therefore score each of the identified treelets by the number of languages with a significant difference in occurrence between the groups of the given demographic variable. Again, we use a rather conservative non-parametric hypothesis test, with Bonferroni correction. Tables 3 and 4 show the results for age and gender, respectively. The first column shows the number of languages in which the treelet (third column) is significant. The fourth and fifth column indicate for which age or gender subgroup the feature is indicative, and how much larger the rate of occurrence is there in percent. The indices in the last column represent containment relationships, i.e., when a treelet is strictly contained in another treelet (indexed by the rank given in the second column).
In the case of gender, three atomic treelets (parts of speech) correlate significantly across all 11 languages. Two treelets correlate significantly across 10 languages. For age, five treelets correlate significantly across 10 languages.
In sum, men seem to use numerals and nouns more than women across languages, whereas women use pronouns and verbs more often. Men use nominal compounds more often than women in nine out of eleven languages. Women, on the other hand, use VP coordinations more in eight out of eleven languages.
For age, some of the more striking patterns involve prepositional phrases, which see higher use in the older age group. In atomic treelets, noun use is slightly higher in the older group, while pronouns are more often used by younger reviewers.
Our results address a central question in variational linguistics, namely whether syntax plays a role in language variation among groups. While this has been long suspected, it was never empirically researched due to the perceived complexity. Our findings are the first to corroborate the hypotheses that language variation goes beyond the lexical level.  We also present the pairwise overlap in significant treelets between (a subset of the) languages. See Table 5. Their diagonal values give the number of significant treelets for that language. Percentages in the pairwise comparisons are normalized by the smallest of the pair. For instance, the 49 % overlap between Sweden (SE) and United Kingdom (UK) in Table 5 means that 49 % of the 182 SE treelets were also significant in UK.
We observe that English variants (UK and US) share many features. The Romance languages also share many features with each other, but Italian and Spanish also share many features with English. In Section 5, we analyze our results in more depth.

Analysis of syntactic variation
Due to space constraints, we restrict our analysis to a few select treelets with good coverage and interpretable results.

Gender differences
The top features for gender differences are mostly atomic (pre-terminals), indicating that we observe the same effect as mentioned previously in the literature (Schler et al., 2006), namely that certain parts-of-speech are prevalent in one gender.
1 , 2 , 3 For all languages, the use of numerals and nouns is significantly correlated with men, while pronouns and verbs are more indicative of women. When looking at the types of pronouns used by men and women, we see very similar distributions, but men tend to use impersonal pronouns (it, what) more than women do. Nouns and numbers are associated with the alleged "information emphasis" of male language use (Schler et al., 2006). Numbers typically indicate prices or model numbers, while nouns are usually company names.
The robustness of POS features could to some extent be explained by the different company categories reviewed by each gender: in COMPUTER & ACCESSORIES and CAR LIGHTS the reviews are predominately by men, while the reviews in the PETS and CLOTHES & FASHION categories are mainly posted by women. Using numerals and nouns is more likely when talking about computers and car lights than when talking about pets and clothing, for example.
4 In English, this treelet is instantiated by examples such as: (1) is/was/are great/quick/easy and is/was/arrived In German, the corresponding examples would be: (2) bin/war zufrieden und werde/würde wieder bestellen (am/was satisfied and will/would order again) Signif. in   Table 3 5.2 Age differences For age, features vary a lot more than for gender, i.e., there is less support for each than there was for the gender features. A few patterns still stand out.
(1) at price (2) with service In German, it is mostly used to express comparisons (1) in Ordnung (alright) (2) am Tag (on the day)

Semantic variation within syntactic categories
Given that a number of the indicative features are single treelets (POS tags), we wondered whether there are certain semantic categories that fill these slots. Since we work across several languages, we are looking for semantically equivalent classes. We collect the most significant adjectives and adverbs for each gender for each language and map the words to all of their possible lexical groups in BabelNet (Navigli and Ponzetto, 2010). This creates lexical equivalence classes. Table 6 shows the results. We purposefully exclude nouns and verbs here, as there is too much variation to detect any patterns. The number of languages that share lexical items from the same BabelNet class is typically smaller than the number of languages that share a treelet. Nevertheless, we observe certain patterns.
The results for gender are presented in Table  6. For adverbs, the division seems to be about intensity: men use more downtoners (approximately; almost; still), while women use more intensifiers (actually; really; truly; quite; lots). This finding is new, in that it directly contradicts the perceived wisdom of female language as being more restrained and hedging.
In their use of adjectives, on the other hand, men highlight "factual" properties of the subject, such as price (inexpensive) and quality (cheap; best; professional), whereas women use more qualitative adjectives that express the speaker's opinion about the subject (fantastic; amazing; pretty) or their own state (happy), although we also find the "factual" assessment simple. Table 7 shows the results for age. There are not many adjectives that group together, and they do not show a clear pattern. Most of the adverbs are indicative of the younger group, although there is overlap with the older group (this is due to different sets of words mapping to the same class). We did not find any evidence for pervasive age effects across languages.

Related Work
Sociolinguistic studies investigate the relation between a speaker's linguistic choices and socio-economic variables. This includes regional origin (Schmidt and Herrgen, 2001;Nerbonne, 2003;Wieling et al., 2011), age (Barke, 2000Barbieri, 2008;Rickford and Price, 2013), gender (Holmes, 1997;Rickford and Price, 2013), social class (Labov, 1964;Milroy and Milroy, 1992;Macaulay, 2001;Macaulay, 2002), and ethnicity (Carter, 2013;Rickford and Price, 2013). We focus on age and gender in this work. Corpus-based studies of variation have largely been conducted either by testing for the presence or absence of a set of pre-defined words (Pennebaker et al., 2001;Pennebaker et al., 2003), or by analysis of the unigram distribution (Barbieri, 2008). This approach restricts the findings to the phenomena defined in the hypothesis, in this case the word list used. In contrast, our approach works beyond the lexical level, is data-driven and thus unconstrained by prior hypotheses. Eisenstein et al. (2011) Table 7: Age: Lexical equivalences in BabelNet regression to predict demographic attributes from term frequencies, and vice versa. Using sparsity-inducing priors, they identify key lexical variations between linguistic communities. While they mention syntactic variation as possible future work, their method has not yet been applied to syntactically parsed data. Our method is simpler than theirs, yet goes beyond words. We learn demographic attributes from raw counts of syntactic treelets rather than term frequencies, and test for group differences between the most predictive treelets and the demographic variables. We also use a sparsity-inducing regularizer. Kendall et al. (2011) study dative alternations on a 250k-words corpus of transcribed spoken Afro-American Vernacular English. They use logistic regression to correlate syntactic features and dialect, similar to Eisenstein et al. (2011), but their study differs from ours in using manually annotated data, studying only one dialect and demographic variable, and using much less data. Stewart (2014) uses POS tags to study morphosyntactic features of Afro-American Vernacular English on Twitter, such as copula deletion, ha-bitual be, null genitive marking, etc. Our study is different from his in using full syntactic analyses, studying variation across age and gender rather than ethnicity, and in studying syntactic variation across several languages.

Conclusion
Syntax has been identified as an important factor in language variation among groups, but not addressed. Previous work has been limited by data size or availability of demographic meta-data. Existing studies on variation have thus mostly focused on lexical and phonological variation.
In contrast, we study the effect of age and gender on syntactic variation across several languages. We use a large-scale data source (international user-review websites) and parse the data, using the same formalisms to maximize comparability. We find several highly significant age-and gender-specific syntactic patterns.
As NLP applications for social media become more widespread, we need to address their performance issues. Our findings suggest that including extra-linguistic factors (which become more and more available) could help improve performance of these systems. This requires a discussion of approaches to corpora construction and the development of new models.