Profile Consistency Identification for Open-domain Dialogue Agents

Maintaining a consistent attribute profile is crucial for dialogue agents to naturally converse with humans. Existing studies on improving attribute consistency mainly explored how to incorporate attribute information in the responses, but few efforts have been made to identify the consistency relations between response and attribute profile. To facilitate the study of profile consistency identification, we create a large-scale human-annotated dataset with over 110K single-turn conversations and their key-value attribute profiles. Explicit relation between response and profile is manually labeled. We also propose a key-value structure information enriched BERT model to identify the profile consistency, and it gained improvements over strong baselines. Further evaluations on downstream tasks demonstrate that the profile consistency identification model is conducive for improving dialogue consistency.


Introduction
Despite the recent advancements in assigning attribute profiles to dialogue agents (Qian et al., 2018;, maintaining a consistent profile is still challenging for an open-domain dialogue agent. Existing works mainly emphasize the incorporation of attribute information in the generated responses (Wolf et al., 2019;Song et al., 2019a;Zheng et al., 2020). Although these models have improved the response consistency by explicitly modeling the profiles, they still face the consistency issue (Welleck et al., 2019). One important reason is that they cannot identify the consistency relations between response and profile.
As shown in Figure 1, the attribute word Beijing is incorporated in the first two responses, but only R 1 is semantically consistent with the speaker's profile. For example, R 2 "I also hope to visit Beijing one day." implies that the speaker has never Figure 1: Left: the key-value attribute profiles of the dialogue agent. Right: a dialogue query with different responses that might be related to the attribute profiles. Among these responses, R 1 entails the current location profile, while R 2 contradicts the profile. Although R 3 does not contain the attribute word Beijing, we could still understand R 3 entails the current location. been to Beijing, which contradicts the speaker's profile. On the other hand, although R 3 does not contain the attribute word Beijing, we could still infer from the words Tsinghua University that the speaker's current location entails the profile. Existing studies (Qian et al., 2018;Zheng et al., 2019) train dialogue agents to produce plausible responses that contain attribute information, but still cannot teach agents to understand the differences of consistency relations in these responses. Welleck et al. (2019) made an early step towards reducing the dialogue consistency identification to natural language inference (NLI) (Bowman et al., 2015), where they learn a mapping from two dialogue utterances to an entailment category. All utterances in Welleck et al. (2019) are natural sentences from the PersonaChat dataset (Zhang et al., 2018). However, structured attribute profiles, such as key-value pairs, are ubiquitous in real-world dialogue systems (Shum et al., 2018). Compared with natural sentences, structured profiles have fixed attribute keys from different domains and specific attribute values from limited candidates. The structure information is also essential to a better understanding of the profile. To endow agents with the ability to identify structured profile consistency, we need a new dataset with fine-grained labels between response and profile, as well as a model that can leverage the structure information in the profile.
In this work, we introduce a human-annotated dataset, named Key-value Profile Identification (KvPI), with over 110K single-turn conversations and corresponding attribute profiles. Three representative domains, gender, location, and constellation, are involved in the human annotation. We hire an annotation team to (1) label the relation (entailed, contradicted, or irrelevant) between each conversation and structured profile, and (2) find out the detailed attribute information in each response.
With the annotated KvPI dataset, we set up different baseline models, and propose a key-value structure information enriched BERT (KvBERT) model, which leverages dependency structures in profiles to enrich the contextual representations. Experimental results show that KvBERT obtains significant improvements over strong baselines. We further test the KvBERT model on two downstream tasks, including a reranking task (Welleck et al., 2019) and a consistency prediction task (Dziri et al., 2019). Evaluation results show that (1) the KvBERT reranking improves response consistency, and (2) the KvBERT consistency prediction has a good agreement with human annotation.
Our contributions are summarized as below: • A KvPI dataset is introduced, which has over 110K fine-grained consistency annotations between responses and their key-value profiles.
• A KvBERT model is proposed for consistency identification, which gained significant improvements over strong baselines.
• Evaluations on downstream tasks show that the profile consistency identification model could be complementary to dialogue models.

Dataset Preparation
In this section, we describe the collection and annotation process of the KvPI dataset: (1) how we collect high-quality conversations and profiles; (2) how we define the consistency relations between responses and profiles; and (3) how we annotate consistency relations for the collected data.

Data Collection
To study the profile consistency identification problem, we use data from Weibo 1 , a popular and plenti-1 https://en.wikipedia.org/wiki/Sina Weibo ful Chinese social media, in which people routinely respond to different posts and have publicly available profiles, such as gender and location. We follow the protocol of the previous profile-based dialogue dataset (Qian et al., 2018;Zheng et al., 2019) to collect Weibo post-response pairs, together with users' available profiles. Here we filter out overly long or short pairs and finally obtain a tuple pool that contains about 30 million tuples, which are in a {profile, post, response} format. Each profile includes three popular attributes: gender, location and constellation, and organized in a keyvalue format. For instance, {gender: female, location:Beijing, constellation: Aquarius}. This format is widely applied in real-world dialogue systems, such as Bowden et al. (2017), Shum et al. (2018), andPichl et al. (2018). Since our goal is to identify explicit consistency relations between response and profile, we filter out the tuples whose response has no profile-related information by employing a pre-trained classifier and heuristic rules. Finally, we obtain about 150K profile-related tuples after filtering.

Consistency Relations
We define three types of consistency relation between the response and profile under the opendomain dialogue setting, which is different from the entailment categories in natural language inference (Bowman et al., 2015;Welleck et al., 2019): Entailed The response is exactly talking about the dialogue agent's attribute information, and the attribute is consistent with its key-value profile.
Contradicted Although the response is talking about the dialogue agent's attribute information, it is contradicted to at least one of the given key-value pairs. For example, given the profile "{location: Beijing}", "I am in Seattle" is contradicted to the profile, while "She lives in Seattle" is not, because the latter is not talking about the dialogue agent's attribute.

Irrelevant
The response contains profile-related information, but the information does not reveal the dialogue agent's own attributes. As exemplified above, "She lives in Seattle" is irrelevant, rather than contradicted, to the dialogue agent's profile "{location: Beijing}". Another example is "I'm interested in the history of Beijing". Although there is the attribute word "Beijing", this response still does not reveal the dialogue agent's location.

Human Annotation
The definitions in Sec 2.2 are also applied in the human annotation process. We hire an annotation team to (1) review whether the response is profilerelated, and (2) annotate the fine-grained information, including consistency labels, domains, and detailed attributes in each response. To ensure quality, each tuple is annotated by three people, and the annotation process lasts nearly four months.
In the annotation process, about 10K tuples are filtered out due to no profile-related information in their responses, and we obtain 140K valid tuples with explicit annotations of consistency relation.

Quality Control
To control the quality of the annotated dataset, we introduce different verification methods: First, in the annotation process, we review 200 randomly sampled tuples every 10,000 annotations. We assign a "gold" label to each tuple and then decided whether the whole annotation batch should be accepted or re-annotated according to the disagreement rate. With tolerance to the different understandings of the dialogue response, we set an empirical acceptance threshold of disagreement rate to 10%. For the majority of annotated batches, the disagreement rate varies from 3% to 7%.
The second verification is conducted by paid annotators. Each consistency label is verified by two annotators. The tuples with a low inter-annotator agreement in their labels are directly discarded from the final dataset. Finally, we obtain 118,540 tuples in the KvPI dataset.
From the final dataset, we randomly sampled 2,000 profile-response pairs to two new annotators. These pairs are also annotated as entailed, contradicted, and irrelevant, as in the completed annotation process. Following Bowman et al. (2015), we calculated the Fleiss' Kappa among the previous labels and two new labels and obtained a kappa of 0.857, which means almost perfect agreement (Landis and Koch, 1977). This result shows that the completed annotation is of good quality.

The KvPI Dataset
We present some examples of the final KvPI dataset in Table 1. The dataset, together with trained models, will be open-sourced for public usage.

Dataset Organization
The KvPI dataset consists of single-turn conversations and profiles, labeled as entailed, contradicted, or irrelevant. Attributes in the dataset profiles come from three domains, including gender, location, and constellation. The profile is organized in a key-value format, for example, {gender: female, location: Beijing, constellation: Leo}.
Gender This domain includes responses that have evidence indicating they are from men or women. Both explicit gender evidence, such as  "I am a girl", and implicit gender evidence, such as "I'm hanging out with my boyfriend", are included.
Location This domain includes responses talking about the locations. Besides the accurate matching of location, data in this domain also needs common sense reasoning, such as whether a city belongs to a province, as shown in the third example in Table 1.
Constellation This domain includes different responses that talk about the constellation. A good number of the responses contain more than one constellation word. Both entailed and irrelevant cases in the KvPI dataset are directly obtained from the annotation results. To balance the number of cases in each relation, we collect the contradicted cases from two sources: (1) the annotated contradicted tuples, and (2) the rewritten entailed tuples. Possible reasons for the originally contradicted cases are that users may forget to update their profiles, or they are intended to present different information about themselves. Data from the first source accounts for about two-thirds of the total contradicted cases. The other part comes from entailed cases. Their profiles have been rewritten to different attributes, with a minimal edit-distance principle, so that they turn into contradicted. Cases from this source are treated as new data in the annotation process. Unqualified rewritten data is discarded. Table 2 summarizes the main statistics of the KvPI dataset. The first and third groups in Table 2 count the number of unique tuples in the dataset. Here a tuple refers to a group of data consisting of a keyvalue profile, a post, a dialogue response, as well as the corresponding domain, the annotated attribute, and the label of consistency relation. The tuple examples can be seen in Table 1. For the second group, it only calculates the average number of tokens in the dialogue responses.

Problem Definition
To equip dialogue agents with the ability to identify consistency, we need to build a profile consistency identification model. This model learns to identify the relation of {entailed, contradicted, irrele-vant} between a (profile, response) pair. Formally, our goal is to learn a mapping function F, and F(P, R) ∈ {e, c, i}, where P ={k 1 : v 1 , ..., k n : v n }, R = w 1 , w 2 , ..., w m . Here P denotes the keyvalue profile, and R denotes the response with m words. e, c, i denote the consistency relations.

Motivation
The main challenge of identifying profile consistency lies in how to model the key-value profiles effectively. Such structured profiles have a common dependency structure, which differs from the natural sentences. For example, from the profile { gender: female, location: Beijing, constellation: Leo}, we can clearly see three dependency relations: female → gender, Beijing → location, and Leo → constellation. Moreover, gender, location, and constellation will define the information in the kv-profile. Here we can see a hierarchical structure of the key-value profiles, as illustrated in Figure 2. More importantly, no matter how the values change, this structure will stay unchanged.
Although large pre-trained models such as BERT implicitly capture dependency information more or less (Clark et al., 2019), we argue that such implicit syntactic information may not be enough to support a powerful contextual representation for reasoning on the highly structured key-value profiles, according to the meaningless dependency parsing results generated by BERT on the structured profiles.
These observations motivate us to incorporate the explicit structure of profiles directly. To this end, we design the KvBERT, which integrates both language representation from BERT and structure representation from tree-LSTM (Zhu et al., 2015).    Figure 2 shows the overall framework of the KvBERT model. On the BERT side, we linearize the key-value pairs into a sequence and treating the responses as another sequence 2 . The input embedding is the sum of four embeddings, including an additional type embedding (Chen et al., 2020) to inform the model of different key-value pairs, as shown in Figure 2. Here we omit the well-known formulations of BERT (Devlin et al., 2019) for brevity. We can get a contextual representation for the linearized sequence through the BERT model.

Model Brief
On the tree-LSTM side, the profiles are parsed to predefined structure, as discussed in Sec 4.2. An example of this structure can be seen in the red part of the Figure 2. In parallel, the responses are passed to a trained parser to fetch the dependency structure. Then the tree-LSTM encodes two structures to corresponding embeddings. Three operations are performed to aggregate information from two embeddings: element-wise multiplication, elementwise difference, and concatenation. The aggregated embedding is followed up by a linear layer to form the final structure representation.
At last, the sentence representation and structure representation are concatenated to form the joint representation for the final linear output layer.

The Dependency Structures
In our model, the dependency structure for profiles is predefined, and for the response, it is obtained 2 Our data collection scheme ensures that all responses contain profile information, which frees the modeling of post. from a trained parser. To complete the structure in the profile, we add a special [KV] token on the top of the dependency structure of the profile. As a result, the [KV] token aggregates information from its child key-value nodes. In contrast, there is no universal dependency structure in the responses. To obtain the structures in the responses, we trained a parser on CDT5.0 (Chineses dependency treebank), achieving 90.72% and 88.38% unlabeled and labeled attachment score. All structure predictions are made in the data preprocessing stage.
A tree-LSTM unit encodes multiple child units or multiple descendant units in a recursive process. Due to the length limit, we recommend readers to get the details from Zhu et al. (2015). For both the predefined structures and the parsed structures, we apply the same depth-first encoding strategy, from every leaf node to the root node, to aggregate the structure information.

Experiments
In this section, we first evaluate the performance of the proposed KvBERT model on identifying profile consistency. After that, we test the trained KvBERT model on two downstream tasks, including a reranking task and a consistency prediction task, to analyze how well the proposed approach performs under practical applications.

Experiment Settings
In our experiments, we train the KvBERT based on the 12-layer BERT-Base-Chinese model, with an embedding and hidden dimension of 768. For the tree-LSTM, we set embedding size to 300 and  output dimension to 50. The dimension of the final representation is 818. The tree-LSTM is firstly pretrained on the KvPI dataset for 13 epochs and then jointly finetuned with BERT representations for 3 epochs. The KvBERT model is implemented in PyTorch. More setting details are in the appendix.

Identifying Profile Consistency
We compare the performance of a variety of baseline models on identifying profile consistency: Feature-based classifier Our goal of setting this baseline was to better understand the difficulty of identifying profile consistency, rather than necessarily a state-of-the-art model. Here we choose SVM as the classifier, with unigram features and bigram features, i.e., SVM+uni+bi. Additionally, the overlaps between profile values and responses are extracted as another feature, which is the SVM+uni+bi+overlap.
Rnn-based NLI model ESIM (Chen et al., 2017) is a powerful natural language inference model, which enhanced the interactions in the LSTM. This model was applied in Welleck et al. (2019) and achieved the best results. Therefore, we set ESIM as the rnn baseline for our experiments.
Pretrained models Large pre-trained transformers have been shown effective for natural language understanding tasks. We choose the Generative Pre-trained Transformer, i.e. GPT (Radford et al., 2018), and Bidirectional Encoder Representations from Transformers, i.e. BERT (Devlin et al., 2019) as our pre-trained baselines. Chen et al. (2020) proposed a TableBERT model, which models structured table information within the BERT framework. We take this model as another pre-trained baseline. We did not explore other pre-trained models in this work, due to the expensive computational costs in preparing their Chinese models. We leave the exploration as future work.
Considering the previous works are designed for natural sentences, for the sake of a fair and thorough comparison, we use templates to convert the key-value profiles into natural sentences. The methods experimented on the converted dataset is marked by a suffix "-template". And the comparative experiments on the original KvPI dataset are marked by "-kv", which linearizes the original key-value profiles, the same as Sec 4.3. Other models are directly evaluated on the original KvPI dataset.
For evaluations, despite the whole dataset that includes all three domains, we are also interested in the model's performance on each individual domain 3 . We use accuracy (acc), which has been widely applied in the natural language inference tasks, to measure the overall performance on each domain. To have a better look at the model's ability on identifying different consistency relations, we also calculate the f1-score of three relations under the same domain, i.e., entail-f1, contr-f1, and irrelv-f1. The accuracy and f1-score are calculated by using toolkits from sklearn.
We report the averaged best results of three different runs on each domain in Table 3. With the explicit modeling of profile structures, our KvBERT achieves the best performance on all metrics across all domains. More importantly, KvBERT is the only model whose all metrics are over 90% on the KvPI test set, especially compared with strong pretrained baselines. Moreover, we also obtain 3.1% absolute improvements on the overall accuracy to the latest TableBERT model (Chen et al., 2020).
We noticed an interesting phenomenon between the BERT-kv and BERT-template: the performance of BERT-template on all three individual domains are better than the BERT-kv's. Nevertheless, on the overall test set, their performances are entirely reversed. One possible reason is that the converted profile loses the structure information. Even for the powerful BERT model, this kind of information still affects the overall performance.

Testing on Downstream Tasks
Now that the KvBERT achieves good performance on the KvPI dataset, we want to test the abilities of the proposed approach further. Similar to the evaluations of pre-trained language models, we evaluate the abilities of our trained KvBERT model on two downstream tasks, with the assistance of human annotation.
Here we consider two types of dialogue models, i.e., retrieval model and generation model. We test the KvBERT on two tasks: (1) Reranking the top 20 responses from a retrieval model, to see whether the profile consistency is improved (Welleck et al., 2019).
(2) Given the responses from state-of-theart generative dialogue models, to see how well the KvBERT's consistency prediction agrees with the 3 Models on each domain are trained separately.

Domains
Entail (  human annotation (Dziri et al., 2019). To build the testbeds of different dialogue models, we use the Chinese PersonalDialog (Zheng et al., 2019) dataset, which consists of over 20 million dialogues from Weibo, together with diversified profile traits and interests tags of the user.
Further, we manually create 100 test samples for each domain, and we abbreviate the test set in this section as Gen (gender), Loc (location), and Con (constellation). Thus there are 300 test samples in total. Each test sample consists of a (profile, post) pair, where the attribute keys are the same as in the KvPI dataset. Moreover, we confirm that these posts will lead to domain-specific responses.

Task I: Reranking Retrieved Responses
We build the retrieval model using pylucene. To retrieve responses, we index both profiles and responses in the PersonalDialog dataset, with weights 0.15 and 0.85 for the profile and response, respectively. We retrieve the top 20 candidate responses for each testing sample, and then these responses are reranked by the trained KvBERT model, according to the order Entailed >Irrelevant >Contradicted. Within the same category, the model confidence will determine the order. Among the 20 responses from one test sample, the top 5 responses, both before and after reranking, are annotated by three people into entailed (Entail), contradicted (Contr), and irrelevant (Irrelv).
We report the statistics of annotation results in Table 4 and show some reranking examples in the appendix. Besides the entailed responses, the irrelevant ones are more acceptable than the contradicted ones. As we can see, the KvBERT reranking improves profile consistency, either by increasing the rate of entailment or by decreasing the rate of contradiction. The annotation results also concur with our intuition: selecting a proper response with the right location is difficult for the retrieval models.

Task II: Consistency Prediction
In this task, we want to test how well the KvBERT's consistency prediction agrees with the human annotation on generated responses. We implement two state-of-the-art profile-based dialogue generation models as the testbeds for this task, including the TransferTransfo (Wolf et al., 2019) (TT) and Atten-tionRouting (Zheng et al., 2020) (AR). Both models are based on pre-trained transformers. First, we pre-train two models on 4G Chinese news data and finetune them on the PersonalDialog dataset. Then we use the trained models to generate responses on the test data Gen, Con, Loc, respectively. The collected responses are annotated into entailed, contradicted, and irrelevant by three annotators. The annotation instructions are the same as in Sec 2.2. In parallel, the KvBERT also predicts the relations between each profile and response.
We first report the f1-score of model prediction against the human annotation in Table 5. We also report Cohen's Kappa (Cohen, 1960) between human annotations and model prediction to measure their agreements directly. All metrics are calculated by sklearn. From the f1-scores, we can see that the model predictions are similar to the human annotations in most cases. And the κ coefficients show the good agreements more directly, where κ between 0.6 and 0.8 indicates substantial agreement, and over 0.8 indicates almost perfect agreement (Landis and Koch, 1977).
Responses from the generative models are in a different distribution from the training data, due to the model learning process. Still, the KvBERT obtains good agreements with humans. It shows the good generalization ability of the proposed method.

Effects of the Structure Information
Another important question is whether the structure information is always helpful. To analyze this, we sampled 9 treeLSTM checkpoints, with accuracy on the KvPI test set from 13.4% to 83.4%. The accuracy could be an indicator of how well the structure information has been captured. Then we Figure 3: The red dashed line in the horizontal direction is the TableBERT accuracy, which has no structural information. The depicted curve is fitted by a seventhdegree polynomial.
trained 9 different KvBERT models with initialization from the 9 treeLSTMs and get final accuracies on the KvPI test set. We depict the treeLSTM accuracy and KvBERT accuracy, as well as a seventhdegree polynomial curve fitting the 9 data points, in Figure 3. And there is a performance baseline shown by the dashed horizontal line, which has no structure information.
As we can see, not all the structural information contributes to the final performance. When the treeLSTM is at a low accuracy, the performance of the KvBERT model is inferior to that of the baseline model. Especially when the accuracy of treeL-STM is lower than 30%, the final performance is even getting worse when the accuracy of treeLSTM grows. And only when the accuracy of treeLSTM is higher than about 80%, can the final performance be improved, as illustrated in Figure 3.

Related Work
This work is closely related to the researches in natural language inference (Bowman et al., 2015). NLI aims to determine whether a natural language hypothesis can be inferred from a natural language premise (Bowman et al., 2015;Williams et al., 2018;Khot et al., 2018;Welleck et al., 2019). Besides the natural language evidence, Suhr et al. (2017) and Suhr et al. (2019) proposed to use images as the evidence for statement verification under the multi-modal setting. A more recent related work is the Chen et al. (2020), who proposed to use semi-structured Wikipedia tables as evidence. The difference between our work and Chen et al. (2020) is noticeable: open-domain dialogues have unique language patterns, and the key-value profiles are highly structured, as analyzed in Sec 4.2. To the best of our knowledge, this is the first work that explores the identification of consistency between dialogue responses and structured profiles.
Another line of research related to this work is the personalized dialogue generation task (Zhang et al., 2018;Qian et al., 2018;Zheng et al., 2019;Song et al., 2019bSong et al., , 2020. This task seeks to improve personality consistency by incorporating persona information in the generated responses. For this purpose, several personalized dialogue datasets have been introduced in recent years, such as Per-sonaChat (Zhang et al., 2018) and PersonalDialog (Zheng et al., 2019). These datasets successfully inform models of how to incorporate attribute related information in the responses, but still can not teach models how to identify the consistency relations between their response and profile.

Conclusion and Discussion
In this work, we introduce a large-scale annotated dataset to facilitate the study of profile consistency identification in open-domain dialogues. We leverage the structure information in profiles to enrich the BERT representations and obtain significant performance improvements over strong baselines. We further test the proposed method on two downstream tasks. Evaluation results show the effectiveness of the proposed approach.
We believe KvPI will be a useful resource for the research of open-domain dialogue consistency. Although there has been a lot of dialogue generation models in this field, most of them still can't understand the consistency relationship in the generation process. One of the major bottlenecks is the lack of data. Because the KvPI dataset has paired key-value profiles and dialogues, it can also be a high-quality resource for personalized dialogue generation tasks. Furthermore, because we have fine-grained consistency labels, this dataset also provides an opportunity to leverage natural language understanding models to assist dialogue generation models. We hope that the data will aid training dialogue agents to be more consistent.