Returning the N to NLP: Towards Contextually Personalized Classification Models

Most NLP models today treat language as universal, even though socio- and psycholingustic research shows that the communicated message is influenced by the characteristics of the speaker as well as the target audience. This paper surveys the landscape of personalization in natural language processing and related fields, and offers a path forward to mitigate the decades of deviation of the NLP tools from sociolingustic findings, allowing to flexibly process the “natural” language of each user rather than enforcing a uniform NLP treatment. It outlines a possible direction to incorporate these aspects into neural NLP models by means of socially contextual personalization, and proposes to shift the focus of our evaluation strategies accordingly.


Introduction
Our language is influenced by one's individual characteristics as well as by the affinity to various sociodemographic groups (Bucholtz and Hall, 2005;McPherson et al., 2001;Eckert and McConnell-Ginet, 2013). Yet the majority of NLP models today treats language as universal, acknowledging that words have different meanings in different semantic context, but typically assuming that this context has the same meaning for everyone. In this paper, I propose that our focus shifts towards interpreting the language together with its userdependent, contextual personal and social aspects, in order to truly process the "natural" language of a user. I outline a possible direction to incorporate these aspects into neural NLP models, and suggest to adjust our evaluation strategies.
The paper is structured with the following aims in mind: Sec. 2 provides historical context, seeking evidence on personalization needs. Sec. 3 reviews existing personalization work, as the personalization efforts and success stories are scattered across contributions to various applied tasks. Sec. 4 contemplates on how NLP personalization could be adopted as a process of several stages. Sec. 5 outlines an implementation proposal on contextually personalized classification models, building upon flexible, socially conditioned user representations. Sec. 6 proposes novel evaluation approaches reflecting the benefit of personalized models. Finally, Sec. 7 opens the discussion on ethical aspects, nonpersonalizable NLP tasks, and the role of industry in personal data collection and protection.

Historical context
Since 1990s, with the rise of so-called empirical or statistical NLP area (Manning et al., 1999;Brill and Mooney, 1997), the focus on frequently appearing phenomena in large textual data sets unavoidably led to NLP tools supporting "standard English" for generic needs of an anonymous user. An NLP toolwhether e.g. a POS tagger, dependency parser, machine translation model or a topic classifier -was typically provided as one trained model for one language (Toutanova et al., 2003;Klein and Manning, 2003;Morton et al., 2005), or, later on, for major underperforming domains, such as Twitter (Gimpel et al., 2011). However, enforcing artificial domain boundaries is suboptimal (Eisenstein, 2013). Neglecting the variety of users and use cases doesn't make the tools universally applicable with the same performance -it only makes our community blind to the built-in bias towards the specifics of user profiles in training data (Hovy, 2015;Tatman, 2017).
Meanwhile, in the information retrieval area, personalization has been incorporated from the early days -it is a long accepted paradigm that different users with different information needs might search for that need using the same query (Verhoeff et al., 1961) and that individual information needs evolve (Taylor, 1968). With the rising popularity of search engines in 1990s, the need for personalization in the interpretation of the query becomes obvious (Wilson, 1999). Exploiting logs of user search interactions allowed personalization at scale (Carbonell and Goldstein, 1998;Sanderson and Croft, 2012). In 2000s, it became acceptable to personalize search results using implicit information about user's interests and activities, e.g. leveraging browsing history or even e-mail conversations (Teevan et al., 2005;Dou et al., 2007;Matthijs and Radlinski, 2011). Today, hardly any of us can imagine that searching e.g. for pizzeria from our cell phone would return the same list of results for everyone no matter our location.
The area of recommendation systems has followed the IR trends, with more emphasis on the social than the personal component. Already early GroupLens Usenet experiments (Miller et al., 1997;Resnick et al., 1994) have shown the effectiveness of personalized article recommendations via collaborative filtering. Acknowledging the potential of personalizing via similar or related users, the focus moved towards exploiting information from user's social networks (Guy et al., 2010;De Francisci Morales et al., 2012;Guy et al., 2009).
Similar developments are emerging for example in the area of personalized language models (Ji et al., 2019;Wen et al., 2012;Yoon et al., 2017;McMahan et al., 2017), which are largely used e.g. in predictive writing, and in natural language generation (Oraby et al., 2018;Harrison et al., 2019), aiming e.g. at selecting and preserving a consistent personality and style within a discourse.
Drawing inspiration from these areas, I argue it is natural for users to expect personalized approaches when an NLP system attempts to interpret their language, i.e., attempts to assign any label to a provided text segment, whether it is, e.g., a sentiment of their sentence, a part-of-speech of a word they used, a sense definition from a knowledge base, or even a translation. As I discuss in the following section, already basic personal information has been shown to be relevant for the system accuracy.
While many experiments have been conducted using discrete variables for demographics and personality, real-valued continuous representation are preferable (Lynn et al., 2017). Numerous researchers have been pointing out that it would be more meaningful to create models building on recent developments in sociolinguistics, i.e. treating demographic variables as fluid and social, e.g. modeling what influences speakers to show more or less of their identity through language, or jointly modeling variation between and within speakers (Eckert and McConnell-Ginet, 2013;Nguyen et al., 2014;Bamman et al., 2014;Eisenstein, 2013).
Improving NLP tasks with user traits Actively accounting for sociodemographic factors in text classification models leads to improved performance across NLP applications. So far, such studies have being conducted most prominently for English language, using age and gender variables, with the most focus on sentiment analysis tasks (Volkova et al., 2013;Hovy, 2015;Lynn et al., 2017;Yang and Eisenstein, 2017). Other explored tasks include topic detection, part-of-speech tagging (Hovy, 2015), prepositional phrase attachment, sarcasm detection (Lynn et al., 2017), fake news detection (Long et al., 2017;Potthast et al., 2018), or detection of mental health issues (Benton et al., 2016). Apart from demographic variables, personality traits play a role as well -e.g. in stance detection (Lynn et al., 2017), sarcasm detection, opinion change prediction (Lukin et al., 2017), prediction of regional life satisfaction or mortality rate (Zamani et al., 2018). NLP models can also improve by exploiting user's past context and prior beliefs, e.g. for sarcasm (Bamman and Smith, 2015), stance prediction (Sasaki et al., 2018), persuasion (Durmus andCardie, 2018) or conversation re-entry (Zeng et al., 2019). Methods used to incorporate the social and psychological variables to models are discussed in Sec. 5.
Improving NLP tasks with social graphs An emerging line of research makes use of social interactions to derive information about the userrepresenting each user as a node in a social graph and creating low dimensional user embeddings induced by neural architecture (Grover and Leskovec, 2016;Qiu et al., 2018). Including network information improves performance on profiling tasks such as predicting user gender (Farnadi et al., 2018) or occupation (Pan et al., 2019), as well as on detecting online behavior such as cyberbullying (Mathur et al., 2018), abusive language use (Qian et al., 2018;Mishra et al., 2018)

NLP personalization as a process
From the user experience perspective, personalization of NLP tools could be divided into three steps.
Explicit input. In the first step, user is allowed to provide personal information for the NLP components explicitly. The depth of information provided can vary from specifying own age to taking personality questionnaires. This user behavior is somewhat similar to subscribing to topics of interest for personalized newsletters -user has a full control over the level of customization. However, results of increasing the burden on the user can be inferior to implicit inference (Teevan et al., 2005). Implicit inference. More conveniently, personal information about the user can be inferred implicitly by the system, as demonstrated e.g. by the models discussed in section 3. The result of such inference can be either a set of explicit labels, or latent user representation capturing similar information in a larger number of data-driven dimensions. For the user, such personalization might currently feel intrusive in the context of an NLP system, however, in many related research areas the user expectations are already altered (cf. Sec. 2).
Contextualized implicit inference. In the third step, personalization includes also an intrauser modeling of different individual contexts based on user's communication goals. This reflects the social science argument that an identity is the product rather than the source of linguistic and other semiotic practices, and identities are relationally constructed through several, often overlapping, aspects of the relationship between self and other, including similarity/difference, genuineness/artifice and authority/delegitimacy (Bucholtz and Hall, 2005). This approach is also aligned with NLP findings on social power in dialogue (Bracewell et al., 2012;Bramsen et al., 2011;Prabhakaran et al., 2012). Such solution can be perceived less invasive by the users, as the contextual adaptation may diminish the otherwise built-in stereotypes of language use (e.g. some users may prefer to use more emotionally charged words in private social contexts, but not necessarily in professional conversations).

Methods of incorporating psychosocial profiles into NLP models
Early experiments used basic demographic variables directly as input features in the model (Volkova et al., 2013). Hovy (2015) uses age and gender as modifying factors for the input word embeddings. In a similar manner, Lynn et al. (2017) uses a multiplicative compositional function to combine continuous user trait scores, inferred via factor analysis, with original feature values, augmenting the feature set so that each feature exists with and without the trait information integrated. Benton et al. (2017) use age and gender as auxiliary tasks in a multitask learning setup for psychological labeling of users. Zamani and Schwartz (2017) apply a residualized control approach for their task, training a language model over the prediction errors of the model trained on sociodemographic variables only. Later they combine it with the factor analysis approach (Zamani et al., 2018). Benton et al. (2016) learns user representations by encoding user's social network as a vector, where users with similar social networks have similar vector representations.
A commonly used technique is to define the "context" for each node, for example by random walks, and train a predictive model to perform context prediction.Similar network-based learning is employed in node2vec (Grover and Leskovec, 2016). Yang and Eisenstein (2017) propose to use neural attention mechanisms in a social graph over followers, mentions and retweets, to leverage linguistic homophily. However, the user modeling approaches discussed so far focus on finding one representation for one user. A modern, personalized NLP system shall be able to capture not only the inherent semantic aspects of the analyzed discourse together with the latent vectorial representations of user characteristics, but also contextual user profiles based on an identity sought in their current social microenvironment. A strengthened industry-academia cooperation is crucial in such data collection (more on this in Sec. 7). Assuming the access to a larger online history of each user, we could draw a parallel to the design of the contextual word embeddings (Peters et al., 2018;Howard and Ruder, 2018;Devlin et al., 2019), which train neural networks as language models, then use the context vectors provided for each word token as pretrained word vectors. With an increasing number of online corpora containing user metadata, we can use recurrent or attentive neural networks to create large-scale social representations of users in a similar manner, allowing multiple pretrained "senses" of each user identity -vector representations of user conversational styles, opinions, interests, etc., treating those representations as dynamically changing in different social contexts. These representations can be then matched to new users based on the sparse linguistic, sociodemographic, psychological, and network information available, and fine-tuned on the context of a given task in a given social microenvironment, e.g. based on the stable part of the personal vectorial representation of the other users present in the conversation.

Evaluation
Currently, most of the NLP ground truth exists in the vacuum, "for everyone". Our systems typically use labels obtained as an average or majority vote provided by a number of impersonated annotators, even for tasks where they highly disagree (Waseem, 2016;Stab and Gurevych, 2014). As pointed out in Bender and Friedman (2018), we rarely get to know anything about the people other than if they were "expert" 1 . If we truly aim at personalizing NLP systems, the first step is understanding who the recipients of our system decisions are. In contrast to 1 read: undergrad students vs. lab colleagues IR, where the user of the interpreted result is normally the author of the query, in NLP the use cases vary. For example, rather than merely labeling a piece of text as a "sarcasm", we shall ask (A) Did the author mean this statement as sarcasm? (B) Was this understood by others as sarcasm? What kind of users interprets this statement as sarcasm?
In the tasks of type A, it is sensible to ask the authors themselves about the intended label (e.g. Are we correct this was a joke / positive review / supportive argument?. We shall further assess the value of the system personalization. E.g. a user may prefer a model that correctly interprets her sarcasm even when most annotators typically don't recognize it. We can take inspiration from subjective measures used in evaluating spoken dialogue systems, such as A/B testing (Kohavi et al., 2014), customer satisfaction (Kelly et al., 2009;Kiseleva et al., 2016) or interestingness (Harrison et al., 2019;Oraby et al., 2018).
Yet most of the tasks are of type B, where we implicitly try to label how a piece of text is perceived by others (e.g. hate speech, assertiveness, persuasiveness, hyperpartisan argumentation). Given that these "others" vary in their judgments (Kenny and Albright, 1987) and this variation is informative for NLP models (Plank et al., 2014;Chklovski and Mihalcea, 2003), I suggest we start caring in NLP explicitly about who these "others" are, and evaluate our models with respect to labels assigned by defined target groups of users (e.g. with regards to sociodemographics, personality, expertise in the task) rather than one objective truth. Initial exploration of this area has been started e.g. for perceived demographics (Volkova and Bachrach, 2016;Carpenter et al., 2017) and natural language inference (Pavlick and Kwiatkowski, 2019).

Ethical considerations
The ability to automatically approximate personal characteristics of online users in order to improve language understanding algorithms requires us to consider a range of ethical concerns.
Unfair use prevention It is almost impossible to prevent abuse of once released technology even when developed with good intentions (Jonas, 1983). Hence it may be more constructive to strive for an informed public, addressing the dual use danger with a preemptive disclosure (Rogaway, 2015; Hovy and Spruit, 2016) -letting potential abusers know that certain illegal and unethical purposes of using personalized models are not supported, and letting potential users know about the risk. For example the European Ethics Guidelines for Trustworthy AI foresee that "Digital records of human behaviour may allow AI systems to infer not only individuals' preferences, but also their sexual orientation, age, gender, religious or political views." and claim that "it must be ensured that data collected about them will not be used to unlawfully or unfairly discriminate against them." Incorrect and stereotypical profiling Sociodemographic classification efforts risk invoking stereotyping and essentialism. Such stereotypes can cause harm even if they are accurate on average differences (Rudman and Glick, 2012). These can be emphasized by the semblance of objectivity created by the use of a computer algorithm (Koolen and van Cranenburgh, 2017). It is important we control for variables in the corpus as well as for own interpretation biases.
Privacy protection Use of any data for personalization shall be transparent. Even public social media data shall be used with consent and in an aggregated manner, no individual posts shall be republished (Hewson and Buchanan, 2013). Regarding explicit consent, research shall take account of users' expectations (Williams et al., 2017;Shilton and Sayles, 2016;Townsend and Wallace, 2016). Similar issue is discussed by Smiley et al. (2017) regarding NLG ethics, as NLG systems can incorporate the background and context of a user to increase the communication effectiveness of the text, but as a result may be missing alternative views. They suggest to address this limitation by making users aware of the use of personalization, similar to addressing provenance.
Role of industry and academia in user data collection Privacy and controllability is an auxiliary task to personalization and adaptation (Torre, 2009). Strictly protecting user privacy when collecting user data for model personalization is of utmost importance for preserving user trust, which is why, perhaps counter-intuitively, I encourage stronger industry-academia collaborations to facilitate a less intrusive data treatment. An inspiration can be taken from the concept of differential privacy (Dwork, 2008), applied e.g. in the differentially private language models (McMahan et al., 2017), which allow to customize for the user with-out incorporating her private vocabulary information into the public cloud model. Similarly, doing academic research on personalized NLP classification tasks directly within industry applications such as mobile apps with explicit user consent would enable transparent experiments at scale, being potentially more secure than gathering and manipulating one-time academic data collections offline. It may also contribute to better generalizability of the conclusions than strictly academic case studies that are typically limited in scale.

Personalization as a harmful ambiguity layer
Given the field bias to reporting personalization results only when successful, no "unpersonalizable" tasks have been defined so far. With that, one question remains open -can we benefit from personalization everywhere across NLP, or are there cases where subjective treatment of a language is not desired, or even harmful? E.g., a legal text shall remain unambiguous to interpretation. On the other hand, the ability to understand it is subjective, and some users may appreciate lexical simplification (Xu et al., 2015). Are there objective NLP tasks as such, or can we segment all of those into an objective and subjective part of the application?

Conclusion
Building upon Eisenstein (2013); Lynn et al. (2017), and Hovy (2018), I argue that, following the historical development in areas related to NLP, users are ready also for the personalization of text classification models, enabling more flexible adaptation to truly processing their "natural" language rather than enforcing a uniform NLP treatment for everyone. Reflecting the current possibilities with available web and mobile data, I propose to expand the existing user modeling approaches in deep learning models with contextual personalization, mirroring different facets of one user in dynamic, socially conditioned vector representations. Modeling demographic and personal variables as dynamic and social will allow to reflect the variety of ways individuals construct their identity by language, and to conduct novel sociolinguistic experiments to better understand the development in online communities. I suggest to also shift the focus of our evaluation strategies towards the individual aims and characteristics of the end users of our labeling models, rather than aggregating all variations into objective truths, which will allow us to pay more attention to present social biases in our models.