Personal Information Leakage Detection in Conversations

The global market size of conversational assistants (chatbots) is expected to grow to USD 9.4 billion by 2024, according to Marketsand-Markets. Despite the wide use of chatbots, leakage of personal information through chat-bots poses serious privacy concerns for their users. In this work, we propose to protect personal information by warning users of detected suspicious sentences generated by conversational assistants. The detection task is formulated as an alignment optimization problem and a new dataset PERSONA-LEAKAGE is collected for evaluation. In this paper, we pro-pose two novel constrained alignment models, which consistently outperform baseline meth-ods on PERSONA-LEAKAGE 1 . Moreover, we conduct analysis on the behavior of recently proposed personalized chit-chat dialogue systems. The empirical results show that those systems suffer more from personal information disclosure than the widely used Seq2Seq model and the language model. In those cases, a signiﬁcant number of information leaking utterances can be detected by our models with high precision.


Introduction
According to Opus Research 2 , 4.5 billion dollars will be invested in conversational assistants (chatbots) by 2021. Among diverse types of chatbots, Google Duplex, first introduced at Google I/O 2018, represents the kind of AI personal assistants (PAs) that act on behalf of people to perform simple tasks, such as making reservations at restaurants and hair salons. In order to successfully complete those tasks, PAs are granted the access to personal information (PI) of their owners, such as number of Figure 1: Given utterances (U) and personal information descriptions (P) from a conversational assistant (a), PILD module (b) detects risky utterances with corresponding personal information and sends a warning (red arrow) to an authorized user (c). The authorized user manually approve or reject the utterances. Then, only the approved utterances (green arrow) are sent to interlocutors (d) who could be authorized or malicious. children, working hours, home address, and vacation plans. Thus, these PAs pose privacy concerns when they communicate with real-life people, or other bots in natural language.
Another major source of personal information leakage is online social networks, which store a huge amount of possibly sensitive information on users and their interactions (Zhang et al., 2010). However, a recent study shows that none of the popular social network platforms (Facebook, Wechat, Google+, etc) have developed a perfectly non-leaky privacy protection mechanism (Yu et al., 2018). In addition, internet users (including a vast number of children and teenagers) often show a phenomenon called privacy paradox, which states that even users with high level of privacy concerns do not always take appropriate actions although those measures are fairly easy to perform (Norberg et al., 2007). As an unfortunate example, children's privacy is often unconsciously compromised by their parents' online behaviour, such as online posting and mes-saging (Minkus et al., 2015).
An ideal privacy protection solution is not to stop using PAs or discourage online socialization, but to have the ability to control the dissemination of personal information (Yu et al., 2018). Personal information can be dispersed through various types of media. In this work, we focus on natural language utterances in conversations articulated by PAs or humans. The ways of controlling such textual information vary significantly w.r.t. platforms, PAs, user preferences, and social circles. Since there is no universally applicable control strategy, we take the first step towards privacy protection by designing a Personal Information Leakage Detection module (PILD) that warns users or alerts PAs whenever an utterance is associated with personal information, as illustrated in Figure 1. The warning module gives authorized users the capability to control information leakage from the start. Then, it is up to users and the design of PAs to decide how they deal with utterances leaking personal information. PAs will communicate with other interlocutors using secure or approved utterances.
We formulate detection of utterances causing personal information leakage as a text alignment problem, which aims to link information leaking utterances to the corresponding textual descriptions of personal information. We consider personal information provided in text, because i) user profiles on popular social network platforms include a significant proportion of textual descriptions, and ii) it is natural for users to share their information with PAs in natural language. Figure 2 demonstrates an example of aligning utterances in a dialogue with a set of personal information descriptions. Those red lines depict the ground-truth alignments between utterances and personal information descriptions. The true alignments are sparse as not all utterances leak personal information, e.g., U1, U3 and U6. Meanwhile, an utterance may be associated with more than one descriptions of personal information, e.g., U2 and U4, and vice versa.
In the absence of direct supervision signals, we explore low annotation-cost solutions to this text alignment problem by considering a weakly supervised setting. In this setting, we only know who speaks what and what are the PI descriptions of each interlocutor during training, without knowing true alignments. The additional challenges are imposed by the complex relationships between utterances and descriptions of PI, which could be sparse alignment, and one-to-one, one-to-many, many-toone, or many-to-many mapping.
To address the aforementioned challenges, we propose two models SHARP-MAX and SPARSE-MAX by formulating the text alignment problem as constrained optimization problems. The training procedure takes the form of contrastive learning (Mnih and Kavukcuoglu, 2013;Dai and Lin, 2017). Herein, we encourage aligning an utterance with the descriptions of its interlocutor subject to sparsity constraints, while penalizing its alignments with those of other speakers. Thus, sentence-level alignments are not employed during training.
The main contributions are the following: • We propose to protect privacy in conversation using PILD. Due to the lack of datasets for the new task, we construct a testing dataset PERSONA-LEAKAGE by extending the test set of the personalized dialogue corpus PERSONA (Zhang et al., 2018) with alignment annotations through crowdsourcing.
• Under weakly supervised setting, we propose two novel alignment models SHARP-MAX and SPARSE-MAX, which leverage coarse grained alignment signals to deliver sparse solutions. Our experiments on PERSONA-LEAKAGE show that our models achieve superior performance than competitive baselines.
• We empirically evaluated four representative dialogue models as PAs on PERSONA-LEAKAGE by letting them act as one of the interlocutors in a dialogue. We found that more advanced dialogue models are prone to leak higher proportion of personal information of the interlocutors they represent. Our PILD module works well on recently proposed dialogue agents.

Alignment Models
In this section, we formally define the problem of PI leakage detection as text alignment between utterances and descriptions of PI in the weakly supervised setting, followed by presenting the architecture shared by the two proposed alignment models SPARSE-MAX and SHARP-MAX. The two models differ in the sparsity regularization for alignments during training. We then detail the training algorithms as well as how to derive the regularizers.

Problem Statement
A dialogue between two interlocutors A and B is composed of two sets of utterances U A and U B . The corresponding persona profiles P A and P B are two sets of PI descriptions. A personalized dialogue dataset D = { U i , P i |i = 1, 2, · · · , N } consists of U i , P i associated with the same interlocutor i in a conversation, where U i = {u i,j |j = 1, 2, · · · , n i } and P i = {p i,k |k = 1, 2, · · · , m i }.
In the weakly supervised setting, a U i , P i from the 'same interlocutor' provides a set-level training signal for learning an alignment between the utterance set and the PI description set. An alignment is a set of links between an utterances set and an description set. This can also be viewed as identifying the edges of a bipartite graph between the two sets of vertices U i and P i . In the absence of alignment annotation during training, we relax the problem by learning alignment strength between u i,j and p i,k as an association score a i,j,k , which constitute an association matrix A i ∈ R n i ×m i for each U i , P i . Then, it is up to the system design of a PA or the preference of an interlocutor to decide if an association score indicates that p i,k is leaked through u i,j . For example, one can check if a i,j,k is above a pre-specified threshold.

Model Architecture
Recent advances in pre-trained language models, such as BERT (Devlin et al., 2019), demonstrate their strengths of encoding semantic information into the produced text representations. Thus we apply a pre-trained language model f (·) (BERT in this work) to convert each utterance and each PI description into its representation vectors. As a widely accepted practice, we take the representation of the [CLS] token to represent an input text. Then, we apply a projection matrix M to map those vectors into a semantic space shared by utterances and PI descriptions, The association score between an utterance u i,j and a PI description p i,k is calculated by the cosine similarity between their representations, where ·, · denotes the inner product of two vectors, As we freeze the parameters of BERT in both training and testing, the only tunable parameters of this model is the matrix M.

Model Training
Learning an association matrix between an utterance set and a PI description set in the weakly supervised setting imposes two challenges. First, there is no ground-truth label to guide the alignment training. Second, an utterance may indicate zero, one, or multiple PI descriptions, while a PI description may also be associated with varying number of utterances.
Loss. To address the first challenge, we observe that i) a linked utterance-PI pair has high semantic relatedness; ii) the utterances in a dialogue are much more likely to correlate with the PI of its interlocutors than that of other interlocutors. The latter observation provides set-level alignment signals for contrastive learning. In light of this, we maximize the set-level aggregated associated scores for utterance-PI pairs from the same interlocutors U i , P i , while minimizing those scores for the pairs from different interlocutors U i , P and U , P i . The second challenge imposes sparsity over the links in alignments. As it is difficult to enforce representation based cosine similarity values to approach zero, we introduce an alignment weight w i,j,k for each utterance-PI pair during training. The weight matrix W i = {w i,j,k } n i ×m i puts a focus on the more reliable utterance-PI pairs and reduces the influence from irrelevant links. Then, the similarity between U i and P i is the weighted sum of all elements in A i .
(3) where denotes hadamard product. High weights in W i will enhance the corresponding association scores during training, while low weights or zeros in W i discourage participation of those corresponding scores.
By putting two ideas together, the loss for the ith training sample is defined as: (4) where U and P are randomly sampled from D, α is a hyper-parameter controlling the margin of the loss. Then the loss on training set is the sum of all example losses L(D) = N i=1 L(U i , P i ). Sparsity. The two models SHARP-MAX and SPARSE-MAX differ in the regularizers used in sim(U i , P i ) for learning sparse weight matrices W i . The matrices W i are expected to assign zeros or low weights to irrelevant pairs, while assigning high weights to the aligned pairs. They are formulated as a constrained optimization problem of the following form, is a regularization term that determines the sparsity of W i , and γ ∈ R + adjusts the degree of regularization. If γ → 0, the solution of the above problem is to assign the weight 1 to the maximal value in A i . As we expect more than one links in an alignment, the regularizer should encourage more non-zero entries in W i . If γ → +∞, the solution is weights with equal values, which aggregates A i by averaging all association scores.
SHARP-MAX utilizes entropy as the regularizer because uniform distribution achieves the maximum of entropy. In another words, this term encourages similar entries in W i . (5), the solution of W i is the following softmax function with temperature γ, Proof Idea: The solution is derived by solving the Lagrangian of Eq. (5): Note that, when the temperature with γ < 1 is sufficiently small, the optimal W i enlarges the differences of the values in A i (SHARP-MAX). If γ = 1, we got the conventional softmax, which is also referred to as SOFT-MAX in our experiments. SPARSE-MAX considers the squared loss on W i as the regularizer, as it controls the sparsity of the matrix by encouraging equal contributions. Proposition 2. Let γ = 1, (Martins and Astudillo, 2016).

PERSONA-LEAKAGE Dataset
In order to evaluate models under the weakly supervised setting, we constructed a dataset PERSONA-LEAKAGE as the test set by annotating the test set of the personalized dialogue corpus PERSONA (Zhang et al., 2018). In that corpus, each dialogue is conversed between two human interlocutors, where each interlocutor is characterized by three to five descriptions of PI. A description of PI describes one aspect of that person, e.g., 'I am a handyman'. For each dialogue, we collected link candidates by pairing each utterance of a interlocutor to each description of his PI. As a result, we constructed a set of link candidates for each interlocutor in a dialogue. For each link candidate, we asked three annotators to judge if the utterance indicates the corresponding PI description. A candidate was considered as aligned if at least two annotators agreed on that decision. In total, we annotated alignments for 968 dialogues, in which there are 6,894 aligned utterance-PI pairs out of 67,601 candidate pairs. Moreover, in order to understand the user perception on sensitivity of PI, we collected a set of all possible PI descriptions in test and dev set of PERSONA, and asked five annotators to judge if the descriptions were sensitive or not. A PI description is considered as sensitive if annotators would suggest not to share it with strangers, given that it describes their friends. We collected 306 descriptions (31.48% among all 972 descriptions) with more than 2 sensitive annotations 3 .

Baselines
We apply the scoring function of two widely used information retrieval (IR) methods TF-IDF and BM25 (Manning et al., 2008;Robertson and Zaragoza, 2009), and the most recent BERT-based IR (Dai and Callan, 2019) to measure the association between a PI description and an utterance.
We also consider the following competitive alignment models proposed in recent works.
• MEAN averages the contribution of association matrix, namely uniform weights ( 1 n i ·m i ). We consider MEAN as the solution of a special case of our optimization problem with γ → +∞.
• Avg-Max  uses the average of the maximum similarity scores for all PI descriptions (Avg-Max-P) or utterances (Avg-Max-U).
3 Appendix B describes more details about data collection.
The weights of all alignment models are normalized to the sum of one.

Model Setting
In order to have a fair comparison, all alignment models share the same deep learning architecture which is composed of i) a pre-trained text representation model (BERT), ii) a learnable linear transformation layer, and iii) a weight computation module without back-propagation. We evaluate the models by testing whether the alignment links between sets are correctly retrieved from all candidates links, following (Hessel et al., 2019). Given the ground-truth alignment between two sets, we evaluate the association matrix A i , by using precision at K (P@K) 5 , R-Precision (Rprec), normalized discounted cumulative gain (NDCG) and mean average precision (MAP) 6 . In addition, we use Hellinger Distance (H-Dist) (Oosterhoff and van Zwet, 2012) to quantify the matching rate of alignment weights W i with ground-truth alignment weights G i = {g i,j,k } n i ×m i , where g i,j,k is normalized over j, k to sum to one.

Collection of Human-Bot Dialogue
To evaluate the performance of our model on chatbots, we collect human-bot dialogues using SOTA personalized chatbots and their competitors: • P 2 Bot (Liu et al., 2020) achieved SOTA performance on automatic metrics by incorporating mutual persona perception. P 2 Bot (w/ Persona) and P 2 Bot (w/o Persona) are models with and without personal information as input when generating responses. topped the human evaluations in ConvAI2 by fine-tuning a pre-trained language model GPT.
• Language Model (Zhang et al., 2018) is an LSTM-based language module for dialogue.
For each chatbot, we provided interlocutor A's dialogue history as input and bot responded as interlocutor B. We performed 60 dialogues and collected 770 utterances for each chatbot. The responses by those chatbots are analyzed in three dimensions. |Sensitive Disclosed PI descriptions| |Disclosed PI descriptions| .
• Hits-at-K (Hits@K) is the percentile of the leaked PI that can be retrieved from top K = 5/10 results using alignment models.
Perplexity (PPL) and uni-gram F1 are supplementary metrics that reflect the performance of bots (Liu et al., 2020).

Empirical Results and Analysis
In this section, we analyze our experimental results. Our experiments are designed to answer the following research questions (RQs), • RQ1: How well do our alignment models perform, in comparison with the competitive baselines?
• RQ2: Why do our alignment models outperform the baselines?
• RQ3: Do the SOTA chatbots disclose PI in dialogues, and are they sensitive? Can we use our alignment models to capture the leakage?

Model Comparison on PERSONA-LEAKAGE
We compare our alignment models, SHARP-MAX and SPARSE-MAX, with IR baselines and alignment baselines, in Table 1   10% for all scores, showing that coarse grain signal is effective for learning semantic relevant for the PI leakage. Avg-Max, OPT and LSAP further outperform MEAN with a margin more than 2% for most of the metrics, as they apply the sparsity constraints in order to focus on aligned utterances and PI descriptions during training. Although these approaches set up competitive baselines on our task, SHARP-MAX and SPARSE-MAX achieve consistent improvement on all evaluation metrics. As SPARSE-MAX cuts off the weights of irrelevant pairs, it performs the best.

Analysis on Alignment Model
We visualize the association scores of each alignment model in Figure 3, in order to qualitatively demonstrate the strengths of our models. LSAP attempts to assign a fixed number of aligned pairs, i.e. min{n i , m i }, which will lead to unavoidable false positive alignment for sparse cases (U8-P5, U5-P3 and U4-P4, in Figure 3a LSAP) and false negative alignment for dense cases (U4-P1 and U4-P2, in Figure 3b LSAP). Avg-Max-P and Avg-Max-U also hold the similar drawback as the number of aligned pairs is exact the number of columns or rows, while does not depend on the cases. In contrast, SPARSE-MAX and SHARP-MAX manage to adapt the number of 'aligned pairs' (deep colored), therefore achieve alignments closer to the ground truth. For SHARP-MAX, we can adjust the sharpness of the weight matrix using sharpness parameter γ. Using sharper model with lower γ manages to alleviate the influence of the pairs with relatively low similarity scores. For SPARSE-MAX, more deterministic alignments are achieved by cutting off pairs with low association scores. Although SHARP-MAX and SPARSE-MAX do not differ much in terms of empirical performance, they are driven by different theories of regularization. The comparison between these two solutions proposed by us helps draw a conclusion that the similarity function should be designed to find a proper degree of sparsity, which does not depend much on a particular choice of regularizer.

Analysis on Personalized Chatbots
In this section, we analyze the engagement and sensitivity of chatbots in human-bot conversations. The experiments are designed to show the risk of privacy leakage when using current chatbot models. For all generated utterances, we retrieved top 10 relevant PI using SPARSE-MAX. Then we asked an-  notators to select the leaked ones from the retrieved PI descriptions. Three annotators are asked to indicate if the utterances leak those PI descriptions.
The results are summarized in Table 2. Compared with bots without PI as inputs, such as Language Model and P 2 Bot (w/o Persona), the bots with PI as input, namely Lost-In-Conversation and P 2 Bot (w/ Persona), tend to acquire higher PIE with significantly higher magnitude. PIE of P 2 Bot even approaches that score of human interlocutors. DPS is correlated to PIE showing that bots with higher PIE generally disclose higher portions of sensitive PI. Although higher PIE and DPS for the chatbots with PI as input is expected, there is also a significant proportion of leakage for the bots without PI as input, e.g., P 2 Bot (w/o Persona). This raises serious privacy concerns in future research on PAs. Furthermore, Hit@K measures the ability of our system for detecting PI leakage. As a warning module, our model SPARSE-MAX manages to detect most of the utterances leaking PI 8 . Our system achieves around 80% of Hit@10 on the responses generated by the two most recent and advanced chatbots, Lost-In-Conversation and P 2 Bot (w/ Persona).

Related Work
Recently, privacy and fairness started to attract more and more attention from NLP community. Sensitive information was removed from latent representations via adversarial training Elazar and Goldberg, 2018) and differential privacy (Fernandes et al., 2019), achieving fair decisions. Privacy-aware text rewriting methods suggested to generate new sentences with less sensitive information (Xu et al., 2019a;Emmery et al., 2018;Xu et al., 2019b;Strengers et al., 2020). Our work serves as a component that detects the sentences requires rewriting. Another line of research aims to identify mentions of pre-defined semantic categories indicating sensitive information in text (Microsoft;Bevendorff et al., 2019), such as bank account and phone number. In our setting, sensitive information can be expressed in any syntactic structures, including events conveyed in whole sentences, such as "I have got less than 5 hours of sleep each night for years" is associated with the persona "I have sleep disorders for many years.". The setting of our work is more general, as we focus on open-domain personal information written in natural language, which is not limited to mentions of fixed semantic categories in sentences or sensitivity labels of sentences.
Our work places an emphasis on privacy concerns in conversations (Huang et al., 2020;Ischen et al., 2019;Gao et al., 2018;Tur et al., 2018;Muthukrishnan et al., 2017). In recent research, several works have attempted to improve the engagement and diversity of chit-chat dialogue system (Liu et al., 2020;Tigunova et al., 2019;Wolf et al., 2019;Zhang et al., 2018) and goal-oriented dialogue system (Luo et al., 2019;Zhang et al., 2019). With the rapid development of personalized dialogue systems, PILD module is expected to address the privacy concerns (Ischen et al., 2019). Welleck et al. (2019) improved the coherence and consistency of a dialogue using Natural Language Inference (NLI) (Bowman et al., 2015). Dialogue-NLI dataset could be utilized to train retrieval models, however, it does not directly address the privacy concerns. In contrast, our dataset i) considers all possible leakage pairs, and ii) includes sensitivity annotations of all PI descriptions.
Our problem setting was inspired by an imagesentence alignment problem, given pairs of image sets and documents (Hessel et al., 2019). Similar problems were also explored in the context of align-ing image fragments with words Jiang et al., 2015). In this paper, we considered utterances and PI descriptions from the same interlocutor as coarse-grained alignment signals, which are in the same modality.

Conclusions and Future Work
We formulate protection of personal information in conversations as a weakly supervised alignment between personal information and dialogue utterances. To tackle this task, we proposed two new alignment models and created a dataset PERSONA-LEAKAGE for evaluation. Our experimental results demonstrate the effectiveness of our methods in comparison with the competitive baselines on that dataset. Further analysis on human-bot dialogue performance demonstrated the potential privacy risks with advanced personalized dialogue techniques. This work is the first step towards fully preventing leakage of privacy in text, which still requires PAs or users to select and hide sensitive information. We hope this work and the dataset will pave the way for the research on privacy leakage in conversations. In the future, we will explore fullfledged solutions to address the privacy concerns of both humans and dialogue systems.

A Malicious Attack on Siri
We conducted an experiment using Siri installed in iPad pro, with iPadOS version 13.3.1 released in January 28 2020. An unauthorized user manage to acquire the owner's personal information by asking Siri questions. The responses by Siri are demonstrated in Figure 4. User name and home address of the facility devide owner is disclosed to the attacker, when asked 'Where do I live?'. Name, partner and home address of the owner's parents are unveiled, when asked 'Who is my father?'. Although Siri represented the answer in form of contact cards, we argue that such risky reactions by personal assistants could appear in natural language responses as well.

B Details for Data Collection
Starting from test set of PERSONA, our dataset basically tops up two annotations on test sets, alignment annotations on utterance-persona pairs and sensitivity annotations on all personal information statements. For both parts, we use Amazon Mechanical Turk (MTurk) 9 for crowdsourcing. We only accept results from the qualified annotators that i) have more than 90% HIT acceptance rate, ii) have finished more than 100 HITs, iii) locate in America. For further quality control, we reject 2.1% and 2.0% unreliable HITs for alignment annotation and sensitivity annotation respectively by automatically rejecting HITs that are i) not completed or ii) inconsistent in answers.
For alignment annotations, annotators were instructed to "find the personal descriptions leaked in a conversation" by "select if the sentence indicates any of the provided personal descriptions or none of them", see task screenshot in Figure 5.
For sensitivity annotations, annotators were instructed to "give advice to a friend who belongs to a vulnerable group", see task screenshot in Figure 6. Sensitive information is defined as the one that "your friend rather not let strangers know".
• Sensitive: In most cases, your friend would rather not to tell a stranger such information.
Otherwise it will do more harm than good if the information is utilized by malicious people.
• Non-sensitive: In most cases, it is safe for your friend to share such information with strangers.

C Hyper-parameter Selection
We provide details about the hyper-parameter selection for baseline alignment models and our models in Table 3. More details about Sharp-Max is demonstrated in Table 4.

D Hits on human-human dialogue
We compare alignment models on Dialogue60, a subset used in our paper, and DialogueTest, the whole test set of PERSONA-LEAKAGE. Overall, Sparse-Max achieves the best performance.   Table 5: Comparison of our alignment models with baselines on human-human conversations using Hits@5/10. Di-alogue60 is the subset used in our paper. DialoguesTest contains all dialogues in test set of PERSONA-LEAKAGE. Sparse-Max results on Dialogue60 is reported in our paper.