Toward Stance-based Personas for Opinionated Dialogues

In the context of chit-chat dialogues it has been shown that endowing systems with a persona profile is important to produce more coherent and meaningful conversations. Still, the representation of such personas has thus far been limited to a fact-based representation (e.g. “I have two cats.”). We argue that these representations remain superficial w.r.t. the complexity of human personality. In this work, we propose to make a step forward and investigate stance-based persona, trying to grasp more profound characteristics, such as opinions, values, and beliefs to drive language generation. To this end, we introduce a novel dataset allowing to explore different stance-based persona representations and their impact on claim generation, showing that they are able to grasp abstract and profound aspects of the author persona.


Introduction
While chit-chat neural models have obtained impressive improvements in recent years, they are known to suffer from key limitations: they tend to lack specificity and to lose coherence as the conversation unfolds, becoming less captivating. One explanation is that they do not have a consistent personality; for this reason, some approaches proposed to explicitly encode the persona via a small set of claims describing the characteristics of the agent, such as "My dad has a car dealership", "I have two cats" (Zhang et al., 2018a). Such representations provide a fact-based background context useful to drive and ground the relevance of the conversational acts for the dialogue at hand, but with little generalization capability. Pushing this approach a step beyond, we thus investigate the construction of stance-based personas, in order to grasp profound and intimate characteristics -such as opinions, values, and beliefs. This could allow agents to sustain personal points of view both within the same conversation and across different discussions.
In this paper, we make a first attempt at representing persona with different approaches and levels of abstraction. We build a new conversational dataset from a social platform dedicated to argumentative interaction, 1 and report experiments for stance-based personas with varying degrees of abstraction (e.g. implicit and explicit stance representation). Our experiments show that stancebased personas enable the agents to intervene, consistently with their representation, across topics unseen at training time.

Related Work
Dialogue datasets and approaches Opendomain dialogue or chit-chat scenarios were considered as intractable problems until recently. The research community has made significant progress thanks to two factors: (i) large datasets and (ii) end-to-end neural approaches based on pre-trained language models. In particular, the idea of using large pre-trained language models finetuned on dialogue tasks has proved very effective (Zhang et al., 2019b;Wolf et al., 2019b). TransferTransfo (Wolf et al., 2019b) used the GPT-2 language model (Radford et al., 2019) with further pre-training over the BooksCorpus dataset (Zhu et al., 2015) and fine-tuning over dialog examples to win the ConvAI2 2018 competition (Dinan et al., 2020).
The advantage of pre-trained, transformer-based, language models is that they can capture long-term dependencies and generate texts that are fluent, varied, and rich in content, mitigating many of the limitations of previous neural dialogue models, such as contents inconsistency Zhang Figure 1: Example of a Kialo discussion * . On the top, the thesis claim; below, the pro and con arguments. * https://www.kialo.com/ artificial-intelligence-ai-limiting-an-ais-freedom-of-thought-is-unethical-15943 et al., 2019a; Gao et al., 2018, lack of longterm contextual coherence (Serban et al., 2017), and blandness Zhang et al., 2018b;Qin et al., 2019).
Persona approaches were recently developed with the introduction of end-to-end dialog system based on Memory Networks, which allow to encode the persona profile as a simple list of statements. One of the first datasets specifically developed for persona-based dialogues was released by Zhang et al. (2018a). Another approach consists in modeling a system persona in terms of interaction style (e.g. formal vs. informal register) as used in goal-oriented settings by Joshi et al. (2017);Luo et al. (2019) to provide personalized interactions. Further, Guerini et al. (2018) showed how injecting these specific persona-related aspects into a conversation can positively affect the interaction in goal-oriented scenarios, both in terms of quality of service and overall perceived quality.
Argumentation and persuasion The relation between argumentation and the language employed has extensively been studied in social sciences and psychology (Miller et al., 1976;Chaiken, 1979Chaiken, , 1980. In Natural Language Processing, Computational Argumentation is an emerging discipline (Reed, 2016;Lippi and Torroni, 2016), wherein various sub-tasks, such as argument detection (Ein-Dor et al., 2019) and stance detection (Bar-Haim et al., 2017), have been explored. Tan et al. (2016); Habernal and Gurevych (2016) developed computational methods to determine the linguistic characteristics used to emphasize arguments and study the quality of arguments (Gretz et al., 2019). Durmus et al. (2019b) proposed a dataset to investigate the effect of the pragmatic and discourse context when determining argument quality. Durmus et al. (2019a) studied more complex argumentative structures, without limiting to a single claim.

The Kialo Dataset
The construction of a stance-based persona requires a deeper peek over the opinions, beliefs, and stances of an author, expressed through textual claims possibly across different topics. To this end, turning to transactional crowd-sourcing approaches is in our opinion not ideal: asking crowd-workers to publish private opinions is ethically questionable, while inducing them to engage meaningfully across several topics poses challenges from a design perspective. Last but not least, collecting a significantly sized dataset would require a consistent budget that can easily amount to hundreds of thousand dollars. 2 For these reasons we turned our attention to Kialo, a public discussion platform letting its users debate in a constructive and rational way with peers. The discussions in Kialo include a wide range of topics from economical or political issues to philosophy, religion or even science fiction. All these elements make it the ideal resource for our goals. In Kialo, the users can easily inspect every aspect and claim of a discussion through a tree-shaped structured visualization and decide where to intervene. In this tree, the top node is defined as the thesis claim and each claim in the tree supports or opposes its parent claim, i.e. pro or con. An example discussion is shown in Figure 1.
We have collected 1,580 English discussions and 241,882 unique claims in these discussions. 3 The number of unique claims in the collected discussions varies widely (µ = 153.08, σ = 269.58), as does their depth (µ = 6.31, σ = 4.79). Considering the structure of the discussions in Kialo, each sample in the dataset we collected is composed by author id, claim id, claim, stance label, parent id, and parent claim. In this respect, the instances in the dataset are similar to single-turn dialogues.
For our experiments below, we sampled 5% of discussions for the test and 5% for the validation sets, resulting in 79 discussions for validation, 79 for test, and 1,422 for training. The sampling has been conducted in a stratified fashion according to the number of the claims in each discussion.

Persona Statistics
To build persona representations, we started from each author id and the claim(s) they wrote. During the design phase, we quantified the activity of the authors. In total, 18,255 authors have contributed to the discussions with various numbers of claims, ranging from a single claim to a maximum of 6,123 claims. The distribution of contributions is, as could be expected, rather skewed: in the training set, 8,569 authors have only 1 claim making it difficult to effectively construct a persona representation; conversely, 3,776 authors have 5 or more unique claims in the training set.
We conducted an instance-level persona analy- 3 The data was collected on March 10, 2020.
sis on the dataset, and observed that the majority of the instances have been written by the authors with 5 or more claims (90% in training, 82% for validation, and 74% for testing). On the other hand, 4% of training, 11% of validation, and 14% of the test instances have been written by authors who have no other claims. Consequently, we propose treating the persona with different sizes as separate conditions. While it is inevitable to segregate the instances written by the authors without any claim in the training set (No Persona) from the rest, we also define a threshold T to distinguish authors with few (< T ) claims (Small Persona) from those with many (>= T ) claims (Big Persona). In this work, we set T = 5. This provides us with the possibility of analyzing the impact of the persona size.
To avoid leakage, the persona of an author is built exclusively from their claims in the training set. The number of instances in each set grouped by the persona sizes is reported in Table 1.

Persona Representations
Further, we designed two persona representations with respect to the claims and the theses. Explicit persona (P exp ) The persona for a Kialo author can be explicitly constructed using a set of claims written by the same author in the training set. With this representation, we can grasp the opinions of an author in a fine-grained manner. The explicit persona representation is in line with the approach of Zhang et al. (2018a), encoding the persona with multiple sentences (5) of textual description. No Persona, Small Persona, and Big Persona distinction has been applied to the explicit persona. Implicit persona (P imp ) We hypothesize that a persona can be represented at a more abstract level, propagating the stance of an author up to a thesis claim, starting from the pro or con labels of their claims in the corresponding discussion. In practice, we consider that the con child of a pro claim of a thesis would be opposing that thesis as well. Since propagating pro and con labels of parent claim: There is historical evidence that Jesus Christ existed, thus there is historical evidence that supports the existence of God. random explicit persona (P exp,random ): There is no evidence to support the assertions of Islam.
[SEP] Civil strife refers to people 's reaction to the results , not how orderly the process was . dynamic explicit persona (P exp,dynamic ): Even if there was a historical person named Jesus of Nazareth , that does not support the idea that he was a god of some kind . [SEP] There is no evidence to support the assertions of Christianity . negative explicit persona (P exp,negative ): The electoral college victories under Bush and Trump have caused tumult and disorder .
[SEP] The first amendment does not apply to public land as has been decided time and time again . implicit persona (P imp ): pro: 1 -con: 0 -text: Military conscription should apply to men and women equally.
[SEP] pro: 14 -con: 3 -text: Conscientious objection to abortion should be banned [SEP] pro: 37 -con: 18 -text: Judaism [SEP] pro: 1 -con: 4 -text: Capital punishment should be abolished in the United States. Table 2: Different persona representations for the same parent claim and author id. For the sake of conciseness we report only the first two claim for each explicit persona representation. these deeper claims from the same author might end up in different stances for the thesis claim, we represent the implicit persona of an author as the thesis claim with the counts of the their pro and con claims.

Model
We frame our problem as a text generation task, where the probability to generate a sequence Y composed of N tokens, y 0 , ..., y N , is given by: where Θ are the learnable parameters, C the parent claim and P the persona. Following previous works on conditional text generation, we use a sequence to sequence model, which is composed of an encoder and a decoder. In particular, we used a transformer architecture (Vaswani et al., 2017) pretrained on a large corpus (Radford et al., 2019;Raffel et al., 2019), as detailed in Section 4.3. To encode multiple inputs (i.e. P and C), we follow (Dong et al., 2019;Raffel et al., 2019) and represent the input as the concatenation of the persona P and the parent claim C, separated by a special token [SEP], rather than representing the persona in a separate memory.

Explicit Persona Selection
For some authors, the explicit persona P exp can contain over a thousand claims (see Section 3).
The concatenation of all these claims would be too long to be encoded within a transformer, given that the computational cost of its attention mechanism is quadratic w.r.t. the length of the sequence. For this reason, we limit the number of claims per persona to maximum 5. For persona containing more than 5 claims, we propose three different selection strategies: • Random (P exp,random ): among the total claims of an author, we randomly select 5.
• Dynamic (P exp,dynamic ): inspired by Information Retrieval literature, we used BM25 (Robertson and Jones, 1976), considering all the author claims as the corpus and the parent claim as the query. We then to retrieve the 5 persona claims most similar to the input.
• Negative (P exp,negative ): we follow the same procedure than Dynamic above, but considering the 5 least similar persona claims. This allows to measure whether broader correlations emerges across distant topics.
In Table 2 we present an example of various persona representations built starting from a unique parent claim and author id combination.

Decoding method
While usually not learned (Negrinho et al., 2018), the decoding strategy is known as being critical and largely affecting the produced outputs. The most common approaches are beam search (Reddy et al., 1977) and sampling. Beam search is used to find  Table 3: F1 scores obtained on the stance classification task. The baseline model has only access to the claim, while P exp,random has also access to the author persona. All indicates results over the entire test set, followed by results on the three subsets described in Section 3.
the output that maximises the model probability, while sampling offers more diversity. However, the latter is very likely to sample from the tail of the distribution, making this method less reliable. To mitigate this limitation, top-k filtering and, more recently, nucleus sampling (Holtzman et al., 2020) have been proposed. Nucleus is an adaptive method to filter the tail distribution. It keeps only the tokens inside the top p % of the mass probability. To the best of our knowledge, this decoding method yields the most realistic generation outputs; therefore we used it for all our experiment.

Implementation details
All the experiments were conducted with T5-small 4 (60 million parameters). T5-small is a smaller version of T5, a text generation model with stateof-the-art results on challenging Language Understanding tasks. 5 For our experiments, we used the Hugging Face implementation of T5 (Wolf et al., 2019a), an for BM25 the implementation of Trotman et al. (2014). 6

Preliminary Study: Stance Classification
Given a parent claim, the answer eventually provided by an author can be either pro or con, but their stance cannot be inferred without knowing something about the author who wrote it. Thus, if the stance-based persona allows to grasp at least the generic position of an author about a topic, it should be predictive of the stance taken by them on the reply claim. We tested this hypothesis in a preliminary experiment, where the task is to learn a function that, given only a parent claim and a persona representation, is able to predict the pro or con label for the provided answer. Following the T5 paradigm (Raffel et al., 2019), we consider this classification problem as a text to text task: given Eq. 1, the model learns to predict the category Y , corresponding to the token pro or con in the vocabulary.
First, we trained a baseline model, given only the parent claim. We expect it to perform poorly -e.g. learning the most probable label if there is a clear majority of stances about a certain topic (e.g. if the Kialo community is mainly against death penalty). Then, we trained a second model P exp,random which can access, in addition to the parent claim, the random author persona.
The results reported in Table 3 show a clear benefit from adding persona information. We observe how, even on the "No Persona" subset of the test samples, the persona information ingested at training time allows P exp,random to perform significantly better than the baseline model.
Moreover, from the ablations on No/Small/Big persona subsets of the test samples, we see that the relative improvements obtained by P exp,random are proportional to the persona size, a fact that further supports our working hypothesis.

Metrics
By far, the most used metrics for text generation tasks, are BLEU (Papineni et al., 2002) or ROUGE (Lin, 2004), both based on n-gram similarity. BLEU stands for BilinguaL Evaluation Understudy and is precision oriented since it was designed to evaluate automatic translation systems. Conversely, ROUGE stands for Recall Oriented Understudy for Gisting Evaluation and was designed to evaluate summarization systems. These metrics have been widely used for other text generation tasks such as generating captions (Vinyals et al., 2015b), questions (Du et al., 2017;Scialom and Staiano, 2019) or poems (Zhang and Lapata, 2014).
However, it is well known that these metrics have important limitations (Wang et al., 2016;Paulus et al., 2017;Scialom et al., 2020): while only one or few ground truth references are available, many are actually plausible; BLEU metrics do not reflect meaning preservation Sulem et al. (2018) and do not map well to human judgements (Novikova et al., 2017). In order to measure other aspects of the gen-  Table 4: Results for the different models on the Kialo Dataset. Figure 2: Cumulative BLEU-4 gain of P exp,hybrid VS P exp,random . Note that these are the same exact trained models, but with a different selection strategy at inference time: the explicit persona is randomly selected for P exp,random , while it is switched to the dynamic one for P exp,hybrid . eration, complementary metrics are frequently used (See et al., 2017). Following their recommendation, we report also: length, the number of tokens for the output; repetition, the percentage of repeated n-grams in the output; and abstractiveness, the percentage of tokens in the output that were not present in the input text. These measures account for important dimension intractable by ROUGE or BLEU. For instance, the copy mechanism (Vinyals et al., 2015a) makes the abstractive models too much extractive (See et al., 2017), while still yielding state-of-the-art ROUGE.

Quantitative Results
We trained the different models and report the main results in Table 4. The baseline model is the only one with No Persona fed in the input. It is also the one performing the worst in term of BLEU, ROUGE and Length.
Adding to the input the implicit persona P imp slightly improves over the baseline results. This is particularly interesting since P imp does not contain any text written by the author, as opposed to the ex-plicit persona. Hence, the improvement cannot be related to the written style of the author, but rather to the stance-content relations, taking advantage of previous topics of interest and the author's opinions. We observe larger BLEU and ROUGE gains with the explicit persona, increasing gradually from the negative to the random and the dynamic persona. As expected, the more the persona is related to the topic, the more its benefits to the model, confirming the interest of a dynamic strategy. We also see that the dynamic strategy achieves the higher abstractiveness w.r.t. the parent claim. However, from a manual analysis, we note that the dynamic model often copies claims from its own persona. Nonetheless, this might still be an efficient strategy, as people might tend to repeat arguments across similar topics.

Hybrid Model
We conducted an additional evaluation for the model trained on random persona, by replacing at inference time the random persona with the dynamic one; we refer to this as Hybrid model, P exp,hybrid . Surprisingly, we see that not only it performs better than the random persona, but also outperforms P exp,dynamic on Length, BLEU, and ROUGE metrics. We hypothesise that this model tended to copy less from the claims during the training, and was forced to learn a more complex strategy, which seems to better generalise and to benefit from the dynamic context at inference.
In Figure 2 we report the cumulative gain in BLEU-4 obtained simply by switching the persona at inference time on the model trained with a random persona. We observe that the largest improvements come for persona size superior to 5: those are the most impacted by the selection strategy, since we limited to 5 claims maximum the persona as explained in Section 4.1.
Zipf distribution While the baseline looks more abstractive in Table 4, this does not necessarily   means that the vocabulary used is more diverse. As a complementary analysis, we thus consider the Zipf distribution shown in Figure 3. We observe that the baseline distribution is the farthest from the human, followed by P imp . Consistently with the ROUGE and BLEU metrics, P exp,hybrid achieves the best performance thanks to a more diverse vocabulary.

Human Evaluation
To get a deeper understanding of our models, we also run a human assessment of the outputs generated by the following model configurations, compared to the ground truth: i) the baseline model that has only access to parent claim, without any persona; ii) P exp,dynamic , trained with parent claims and the explicit, dynamically selected, persona; and iii) P exp,hybrid , also trained with parent claims and the explicit, dynamically selected, persona, but fed at inference time with a dynamic selection of the persona (corresponding to the last row in Table 4).
Evaluation Protocol To evaluate each generated output w.r.t. the author persona, it is important to chose a neutral representation of this persona, so to avoid favoring any model and biasing the human evaluation. We decided to use P imp , the implicit persona, which we believe is the most neutral amongst the 4 models we evaluate.
We randomly sampled 50 claims from the test set, under the constraint that the corresponding authors had provided at least 10 claims to the training set. The pool of eligible claims under such criterium compounds to 10,995 (out of the 11,689 in the test set) from 1,251 different authors. This ensures that a large persona representation can be built for all the selected samples. We asked three professional English speakers to score their relevance towards the implicit persona and the parent claim, on a Likert scale ranging from 1 to 5.
To assess relevance, the annotators were presented only with the sample to evaluate, paired with either the corresponding parent claim or the associated implicit persona.

Results
We report the results in Table 5. Consistently with the automatic evaluation, P exp,hybrid performs the best, while the baseline scores poorly for relevance toward both the persona and the parent claim. We also observe that P exp,dynamic achieves similar results than P exp,hybrid for the Persona score, while underperforms it w.r.t. to the Parent Claim. This confirms our hypothesis (see Section 5.2.2) that while both models benefits from the dynamic representation of the persona at inference, P exp,dynamic during training learns to focus too frequently on the persona, a behavior which P exp,hybrid exhibits less.
Persona perception we asked the human evaluators to verbalize their interpretation of the implicit persona representation (P imp ) for few examples, to see if it is actually perceived as meaningful by humans. Results are rather clear: the implicit representation is (i) perceived as meaningful by all annotators, and (ii) used to infer the possible position of the persona given a claim -even if not directly related to the claims in persona representation. In Table 6 we report an example of the feedback provided by one evaluator.
Switching the persona We also conducted a qualitative experiment to observe the impact of the persona on the output. For few claims, we manually modified the implicit persona and the stance label to see the effect of manual intervention. In Table 7 we report different outputs answering to the same parent claim about Universal Basic Income (UBI). All persona successfully generated arguments on the topic, supporting or opposing it consistently with their profile. The 'artist' (P1), links creativity and financial needs, Implicit Persona (P i ): pro: 2 -con: 0 -text: Humans should stop eating animal meat. [SEP] pro: 1con: 6 -text: The US should not try to force North Korea to abandon its nuclear program. [SEP] pro: 1 -con: 3 -text: Private property should exist in outer space. Annotator Feedback: "This persona seems to me a kind of vegan/anti-nuclear/hippy [...] to sum up something like a Californian democratic geek". claim: On the Historicity of Jesus : Why We Might Have Reason for Doubt by Richard Carrier provides evidence that Jesus Christ did not exist. Annotator Feedback: "I think this is relevant because we can expect our 'Californian geek' to be atheist but with a intellectual justification to the topic." Table 6: How an implicit Persona is interpreted/perceived by annotators. The subsequent claim can receive an high score only if an inference is applied from the implicit persona. The annotator feedback suggest this is the case. Financial dimension is really deeply impacting their crash creative endeavors.
P2 "the doctor" PRO Everyone should have access to medical care.
[SEP] It takes time to become a doctor but it is a necessary condition so one is able to properly practice.
It takes time to become a doctor but it is a necessary condition so one is able to properly practice. Maintaining a Universal Basic Income is important.
P3 "the liberal" CON Without liberalism, more crises would have occurred.
[SEP] Liberalism and freedom have made the USA the most powerful and wealthy country in the world. Regulation and tax would damage this situation.
Without free choices that become illegal to not be held responsible, beneficiary chooses not to work. while the 'doctor' (P2) seems to connect the long time required to become a doctor with the need for a Universal Basic Income. Finally, the 'liberal' persona (P3) generates an argument opposed to UBI, in which they seem to connect the absence of free choice with the tendency of beneficiary to stop working under UBI.

Conclusions
Endowing dialogue agents with persona profiles is important to produce more coherent and meaningful conversations. In particular, we argue for using stance-based personas to drive language generation consistently with profound characteristics -such as opinions, values, and beliefs. To this end, we introduced a novel dataset and explored diverse stance-based persona representations and their impact on claim generation.
In future works, we plan to enrich the persona representation with additional information available in Kialo (e.g. authors' votes to others claims), to encode more complex profiles; further, we will extend the presented approach to multi-turn interactions, as enabled by the Kialo discussions structure.