Would you Rather? A New Benchmark for Learning Machine Alignment with Cultural Values and Social Preferences

Understanding human preferences, along with cultural and social nuances, lives at the heart of natural language understanding. Concretely, we present a new task and corpus for learning alignments between machine and human preferences. Our newly introduced problem is concerned with predicting the preferable options from two sentences describing scenarios that may involve social and cultural situations. Our problem is framed as a natural language inference task with crowd-sourced preference votes by human players, obtained from a gamified voting platform. We benchmark several state-of-the-art neural models, along with BERT and friends on this task. Our experimental results show that current state-of-the-art NLP models still leave much room for improvement.


Introduction
The ability to understanding social nuances and human preferences is central to natural language understanding. This also enables better alignment of machine learning models with human values, eventually leading to better human-compatible AI applications Leslie, 2019;Rosenfeld and Kraus, 2018;Amodei et al., 2016;Russell and Norvig, 2016).
There exist a plethora of work on studying optimal decision-making under a variety of situations (Edwards, 1954;Bottom, 2004;Plonsky et al., 2019;. On the other hand, cognitive models of human decision-making are usually based on small datasets . Furthermore, these studies tend to only consider individuals in isolation. In contrast, we investigate the influence of cultural and social nuances for choice prediction at scale. In other words, we study the social preference as a whole, * First two authors contributed equally not those of an individual in isolation, which is arguably more challenging and largely unexplored. In this work, we propose a new benchmark dataset with a large number of 200k data points, Machine Alignment with Cultural values and Social preferences (MACS), for learning AI alignment with humans. Our dataset is based on a popular gamified voting platform, namely the game of 'would you rather?'. In this game, participants are given two choices and vote for the more preferable option. Examples from our dataset can be found at Table 1. To the best of our knowledge, our work is the first work to incorporate voting-based language games as a language understanding benchmark.
In many ways, our benchmark dataset is reminiscent of the natural language inference problem (MacCartney, 2009;Bowman et al., 2015), social commonsense reasoning (Sap et al., 2019) or other natural language understanding problems (Wang et al., 2018;Zellers et al., 2018). To this end, our problem is framed in a way that enables convenient benchmarking of existing state-of-the-art NLU models such as BERT (Devlin et al., 2018) or RoBERTa (Liu et al., 2019).
That said, unlike many NLU datasets that rely on few annotators, the key differentiator lies in the fact that our dataset aggregates across hundreds or thousands and beyond for each data point. Options are also crowd-sourced and gamified which may encourage less monotonic samples, ie., encouraging players to come up with questionss that are difficult for other players. Additionally, our dataset comprises of country-level statistics, which enable us to perform cultural-level prediction of preferences.
Our Contributions All in all, the prime contribution of this work is as follows: • We propose a new NLU benchmark based on an online gamified voting platform.
• We propose several ways to formulate the problem, including absolute and relative preference prediction. We also introduce a cultural-level NLU problem formulation.
• We investigate state-of-the-art NLU models such as BERT (Devlin et al., 2018), RobERTA (Liu et al., 2019) and XLNET (Yang et al., 2019) on this dataset. Empirical results suggests that our benchmark is reasonably difficult and there is a huge room for improvement.

Learning Alignment with Human Preferences
This section describes the proposed dataset and problem formulation.

Dataset
We look to crowdsourcing platforms to construct our dataset. Our dataset is constructed from https://www.rrrather.com/, an online platform 1 for gamified voting. The platform is modeled after the famous internet game -would you rather?, which pits two supposedly comparable choices together. Whenever a player votes, their vote is recorded in the system. Players generally vote to see how well their vote aligns with the majority and consensus with everyone else. We provide samples of the problem space in Table 1. We crawled data from the said platform and filtered away posts with less than 500 total votes. In total, we amassed 194,525 data points, which we split into train/dev/test splits in an 80/10/10 fashion. Dataset statistics are provided in Table 2.

Why is this interesting?
This section outlines the benefits of our proposed dataset as a language understanding benchmark.
(1) Understanding before Interaction. In our dataset and problem formulation, complex understanding of each option text is often required first before modeling the relative preference between two options. This is unlike NLI or questionanswering based NLU benchmarks, where matching signals can be used to predict the outcome easily. In our dataset and task, it is imperative that any form of word overlap can be hardly used to determine the outcome.
(2) A good coverage of social preferences. Upon closer inspection of our proposed benchmark, we find there is a good representation of samples which cover social and cultural themes.
Social preferences (such as the preference of brands) are captured in samples such as example (6).
(3) Completely natural. Our MACS dataset completely exists in the wild naturally. This is unlike datasets that have to be annotated by mechanical turkers or paid raters. In general, there is a lack of incentives for turkers to provide highquality ratings, which often results in problems such as annotation artifacts. Unlike these datasets, our MACS dataset completely exists in the wild naturally. The choices are often created by other human players. Hence, in the spirit of competitiveness, this means that the data is meant to be deliberately challenging. Moreover, there are at least 500 annotators for each sample, which makes the assigned label less susceptible to noisy raters.

Problem Formulation
Given Q (prompt), two sentences S1 and S2 and V (.) which computes the absolute votes to each option, we explore different sub-tasks (or variant problem formulation).
Predicting Preference This task is primarily concerned with predicting if V (S1) > V (S2) or otherwise. Intuitively, if a model is able to solve this task (perform equivalent to a human player), we consider it to have some fundamental understanding of human values and social preferences. We frame this task in two ways. The first is a straightforward binary classification problem, i.e., V (S1) > V (S2). The second task is a three-way classification problem with a third class predicting if the difference |V (S1) − V (S2)| is less than 5% of the total votes. In short, this means that two options are almost in a draw.

Prompt
Option A Option B (1) Would you rather fit into any group but never be popular only fit into the popular group (2) Would you rather have no one attend your funeral wedding (3) Would you rather have free starbucks for an entire year free itunes forever (4) Would you rather Look unhealthy and unattractive, but be in perfect health.
Be absolutely beautiful and look healthy, but be in extremely bad health. (5) Would you rather Win the lottery Live twice as long (6) Would you rather have a Mac a PC (7) Would you rather spend the day Surfing on the ocean Surfing the Internet   (Devlin et al., 2018), XLNEt (Yang et al., 2019) and RoBERTa (Liu et al., 2019) on MACS dataset.

Predicting Cultural Preferences
We consider a variant of the preference prediction problem. Our MACS dataset has culture-level preference votes which are the voting scores with respect to a particular cultural demographic. We extend the same setting as Task 1 by requiring the model to produce culture-level predictions. In order to do this, we prepend the input sentence with a culture embedding token. The dataset is augmented at the culture level and the same example is duplicated for each culture, e.g., we duplicate the sample for countries 'USA' and 'Europe'. We consider only culturelevel votes with a threshold above 25 votes in the dataset for train/dev/test sets.

Predicting Relative Preference
The third variant is a fine-grained regression task where we want to identify if our model is able to learn the extent of preference given by human players. This problem is framed as a regression problem that is normalized from [0, 1] with respect to the total number of votes in the data point

Experiments
This section outlines our experimental setup and results.

Experimental Setup
We implement and run several models on this dataset. is a robustly optimized improvement over the vanilla BERT model. All models were run using the finetune methodology using the standard Pytorch Huggingface 2 repository. We train (finetune) all models for 3 epochs using the default hyperparameters..

Metrics
The evaluation metrics for classification tasks is the standard accuracy score. For regression tasks, we use the correlation, Pearson, and Spearman metrics. Table 3 reports our results on binary and three-way classification on the MACS dataset. In general, we find that RoBERTa performs the best. However, in most cases, the performance of all three models still leaves a lot to be desired. An accuracy of 60%+ shows that state-of-the-art models still struggle at this task. On the other hand, results on regression task are also similarly lacklustre, and   show that models like BERT and RoBERTa are unable to perform well on this task. On a whole, it is good to note that RoBERTa performs the best out of the three compared models.

Experimental Results
Overall, this encourages further research on cultural and social commonsense reasoning in the current state-of-the-art in natural language understanding. All in all, we hope our benchmark serves as a useful tool for understanding the social capabilities of these models. Table 5 reports some sample of our model outputs, shedding light on examples in which our model does well and otherwise. We observe that the model often gets the answer wrong even when the ground truth is overwhelmingly swayed towards one side. On the other hand, occasionally, we also observe that the model can get questionable questions such as (4) and (5) correctly even despite the tight draw between human voters.

Conclusion
We propose MACS (Machine Alignment with Cultural and Social Preferences), a new benchmark dataset for learning machine alignment with human cultural and social preferences. MACS encompasses and requires social and cultural reasoning to solve and an overall holistic understanding of humanity. It is designed to be challenging where state-of-the-art NLP models still struggle at ≈ 60%.

Broader Impact
In this paper, we are not promoting the use of https://www.rrrather.com/ as the training source, but rather the study of the alignment of machine learning models with social preference of a large population. Unfortunately, there might be some issues of bias, fairness and representation due to the curation of the training data from Internet, which might lead models to give prejudiced or stereotyped outputs. Evaluating bias, fairness and representation in language models and the training data is an important research area (Nadeem et al., 2020;Huang et al., 2019). As for future works, it is important to characterize and intervene biases when designing such tasks.