CHARM: Inferring Personal Attributes from Conversations

Personal knowledge about users’ professions, hobbies, favorite food, and travel preferences, among others, is a valuable asset for individualized AI, such as recommenders or chatbots. Conversations in social media, such as Reddit, are a rich source of data for inferring personal facts. Prior work developed supervised methods to extract this knowledge, but these approaches can not generalize beyond attribute values with ample labeled training samples. This paper overcomes this limitation by devising CHARM: a zero-shot learning method that creatively leverages keyword extraction and document retrieval in order to predict attribute values that were never seen during training. Experiments with large datasets from Reddit show the viability of CHARM for open-ended attributes, such as professions and hobbies.


Introduction
Motivation. Personal Knowledge Bases (PKBs) capture individual user traits for customizing downstream applications like chatbots or recommender systems (Balog et al., 2019). A potentially automatic way to populate a PKB is to draw personal knowledge from the user's conversations in social media and dialogues on other platforms. These interactions are a rich source of personal attributes, such as hobbies, professions, cities visited, medical conditions (experienced by the user) and many more. Each of these would consist of key-value pairs, such as cities visited:Paris or symptom:dizziness. However, a large number of potential attributes and their respective values makes this a challenging task. In particular, there is little hope to have training data for each of these key-value pairs. Moreover, the textual cues in user conversations are often implicit and thus difficult to learn. Example. Consider the user's utterance: "I just visited London, which was a disaster. My hotel was a headache and I spent half the time in bed with a fever... So glad to be back home finishing the masts on my galleon." As humans, we can infer the following attribute-value pairs: (a) cities visited:London, (b) symptom:fever, (c) hobby:model ships. Capturing such user traits is a daunting task, however, with both implicit and explicit signals present. We need to consider the context "spent in bed with", to infer that fever relates to a disease (as opposed to headache). To predict the user's hobby model ships, we have to pay attention to the cues 'galleon' and 'mast'. Proper inference requires both deep language understanding and background knowledge (e.g., about ships, cities, etc.).
State of the Art and its Limitations. Explicit mentions of attribute-value pairs can be captured by pattern-based methods (e.g., Li et al. (2014); Yen et al. (2019)). Such methods are able to extract London from the the previous example by using the pattern "I . . . visited city name ". Pattern-based approaches are limited, though, by their inability to consider implicit contexts, such as "finishing the masts on my galleon". Question answering methods can be used to relax rigid patterns (e.g., Levy et al. (2017)), but still rely on explicit mentions of attribute values.
In this work we aim to extract attribute values leveraging both explicit and implicit cues, such as inferring symptom:fever and hobby:model ships. Additionally, we address the cases where there is a long-tailed set of values for such attributes as hobby. In principle, deep learning is suitable for such inference (Tigunova et al., 2019;Preoţiuc-Pietro et al., 2015;Rao et al., 2010), but it critically hinges on the availability of labeled training samples for every attribute value that the model should predict. Supervised training is suitable for a pre-specified limited-scope setting, such as learning personal interest from a fixed list of ten movie genres, but it does not work for the situation with large and open-ended sets of possible values, for which there is little hope of obtaining comprehensive training samples. Therefore, we pursue a zero-shot learning (Larochelle et al., 2008;Palatucci et al., 2009) approach that learns from labeled samples for a small subset of labels (i.e., attribute values in our setting) and generalizes to the full set of labels including values unseen at training time.
Problem Statement. For a given attribute we consider the set of known values V , which can be drawn from lists in dictionary-like sources like Wikipedia. At training time, our method requires samples for a small subset of values S ⊂ V . Typically, the complement V \ S is much larger than S: |V \ S| |S|. For instance, S may consist solely of the popular values sports, travel, reading, music, games, whereas the complement includes hundreds of long-tail values, such as beach volleyball, model ships, brewing, etc. At inference time we need to predict values from all of V , although most of the values are unseen during training.

Approach and Contributions.
We present CHARM, a Conversational Hidden Attribute Retrieval Model, for inferring attribute values in a zero-shot setting. CHARM identifies cues in related to a target attribute, which it then uses to retrieve relevant texts from external document collections, indicative of different attribute values. These external documents could be gathered by simple web search. They help CHARM to link the cues in the user's utterances to the actual attribute values to predict. CHARM consists of two components: (i) a cue detector, which identifies attribute-relevant keywords in a user's utterances (e.g., galleon), and (ii) a value ranker, which matches these keywords against documents that indicate possible values of the attribute (e.g., model ships).
To evaluate our approach, we conduct experiments predicting Reddit users' professions and hobbies based on their conversational utterances. We demonstrate that CHARM performs well when inferring unseen values and performs competitively with the best-performing baselines when predicting values seen during training. CHARM can easily be extended to other attributes with long-tail values, such as favorite cuisine, preferred news topics or medication taken, by providing a list of known attribute values, training examples for a subset of these values and access to external documents (e.g., via a Web search engine).
The salient contributions of this paper are: (1) a method for inferring both seen and previously unseen (zero-shot) attribute values from a user's conversational utterances; (2) a comprehensive evaluation for the profession and hobby attributes over a large dataset of Reddit discussions; and (3) labeled data and code as resources for later research. 1 2 2 Related Work User profiling from utterances. There is ample prior work on classification models to predict a user's personal traits based on hand-crafted textual features (Preoţiuc-Pietro et al., 2015;Basile et al., 2017), or with embedding-based representations (Li et al., 2016;Bayot and Gonçalves, 2018;Tigunova et al., 2019). While classification models work well for inferring demographic attributes with a small set of values such as age, gender or occupational class (Preoţiuc-Pietro et al., 2015;Flekova et al., 2016;Basile et al., 2017) their dependence on seeing all attribute values in (sufficiently many) labeled training samples renders supervised classifiers inappropriate for open-ended attributes such as profession (Tigunova et al., 2019), hobby (Bando et al., 2019) or favorite food , which are often modeled as a binary multilabel task predicting the presence of each attribute value (Welch et al., 2019). Similar to our approach, some studies map user input to Wikipedia concepts (Abel et al., 2011;Krishnamurthy et al., 2014) to predict interests or locations. However, this method requires explicit mentions of the entities.
Pattern-based approaches alleviate the problem of the lack of labeled entities for long-tail classes by employing information extraction techniques to obtain personal attribute values from users' utterances, using sequence labeling methods (Jing et al., 2007;Li et al., 2014) or context classification (Yen et al., 2019). However, their coverage is limited because they require crisp and explicit statements, like "I am a student", which are infrequent in conversations.
Our approach is designed for handling attribute values that were never seen at training time. This is known as the zero-shot learning problem, which has been widely studied in the field of computer vision but less explored in NLP. We employ a technique similar to Ba et al. (2015) for visual classes, which builds image classifiers directly from encyclopedia articles without training images.
Most zero-shot studies for NLP (Wang et al., 2019) deal with machine translation, cross-lingual retrieval and entity/relation extraction (Levy et al., 2017;Pasupat and Liang, 2014), which are not suitable for our task, because they identify values that are explicitly mentioned rather than inferring them. Our task is similar to zero-shot text classification (Yazdani and Henderson, 2015;Zhang et al., 2019), where the class labels are represented as single-word embeddings. We consider a zero-shot BERT baseline (Devlin et al., 2018) that matches utterances with rich document representations. Keyword extraction from conversational text. Notable applications of keyword extraction from conversational text include just-in-time information retrieval (Habibi and Popescu-Belis, 2015), with continuous monitoring of users activities (e.g., participation in meetings) and generating personalized tags for Twitter users (Wu et al., 2010) or search for relevant email attachments (Van Gysel et al., 2017). Prior work mostly pursued unsupervised approaches, e.g. TextRank (Mihalcea and Tarau, 2004) and RAKE (Rose et al., 2010), due to limited availability of training data. Exceptions use supervised learning, with feature-based classifiers (Kim and Baldwin, 2012) or neural sequence tagging models (Zhang et al., 2016).
Our neural approach lies in between, as we learn to identify salient keywords for a specific attribute (e.g., profession), without having training data of relevant keywords. Information Retrieval in NLP. Most existing work leveraging Information Retrieval (IR) components to solve NLP tasks focused on Question Answering (QA) (Kratzwald and Feuerriegel, 2018;Wang et al., 2018;Guu et al., 2020) or dialogue systems (Feng et al., 2019;Luo et al., 2019), where the retrieval part is responsible for ranking the most appropriate answers or responses, given a question or chat session. As far as we know, we are the first to leverage a retrieval-based model for inferring attribute values without training samples.

Methodology
Overview. As illustrated in Figure 1, CHARM consists of two stages: cue detection and value ranking. As input CHARM receives a user's utterances U = u 0 ..u N that contain a set of terms t 0 ..t M , for example, U ={"I stayed late at the li- brary yesterday", "Studied for the exam so I could have better grades than my classmates"}. In the first stage, the term scoring model assigns a score to each term in the user's utterances, yielding l 0 ..l M . The highest scoring terms are then selected to form a query Q = q 0 ..q K characterizing the user's correct attribute value, e.g., Q ="library studied exam grades classmates" for the profession attribute.
In the second stage, Q is evaluated against an external document collection D = d 0 ..d L ; each document in D is associated with possible attribute values. Documents such as Wiki:Student and Wiki:Dean's List, which are associated with the attribute value student, would score high with the example query. The score aggregator then ranks the attribute values based on the documents' scores s 0 ..s L , for instance, yielding a high attribute score for student given our example utterances. The list of attribute values V is known in advance (e.g., taken from Wikipedia lists); however, potentially only a subset of values S ⊂ V have instances seen during training.

Cue detection
The term scoring model δ evaluates how useful each word in a given user's utterances is for making a prediction, and assigns real-value scores l 0 ...l M to the terms accordingly. That is, l j = δ(t j |t 0 , ..t M ; W ), where W denotes the parameters of the model. The term scores l 0 ..l M are then used to select the words which will form the query for the value ranking component. The term scoring model should produce high scores for terms that are descriptive of the user and of the attribute in general, instead of a specific attribute value. This means that it should be able to exploit background knowledge and a term's context to judge its relevance to the attribute. For instance, having seen the phrase "stayed late at the hospital" for the physician at training time, at prediction time an ideal model would correctly estimate the importance of the word 'library' in the phrase "stayed late at the library", even if there were no instances of student in the training set.
BERT (Devlin et al., 2018) is well-suited for this requirement, because it is a sequential model that effectively uses word context and incorporates world knowledge.
For further description, let us suppose the cue detector picks the words Q = q 0 ...q K as our query terms for CHARM's value ranking stage. A typical query would consist of the terms associated with the correct attribute value (e.g., Q ="library studied exam grades classmates").

Value ranking
The second stage of the model consists of two steps: first, using the selected query terms to rank the documents in the external collection; and second, aggregating document scores to predict values. Document ranking. The ranking component takes two inputs: query terms Q = q 0 ...q K resulting from the cue detector and an (automatically labeled) document collection D = d 0 ...d L . The document collection could be a set of Web pages, where each page indicates a specific attribute value, v 0 ...v L . For example, by generating a search-engine query "hobby value " we can gather web pages related to specific hobbies.
The ranker ρ(Q, d k ) evaluates the query Q, constructed by the cue detector, against each document d k in the document collection to produce document relevance scores s 0 ...s L . For the example query "library studied exam grades classmates", the document Wiki:Dean's List labeled with student will get a higher score than Wiki:Junior doctor (for physician). We consider two particular instantiations of the ranker: BM25 (Robertson et al., 1995) and KNRM (Xiong et al., 2017). BM25 is a strong unsupervised retrieval model, whereas KNRM is an efficient neural retrieval model that can consider se-mantic similarity via term embeddings in addition to considering exact matches of query terms.
Document score aggregation. The document scores s 0 ...s L obtained from the ranker are then aggregated to produce scores for each known attribute value. Depending on the document collection used, each attribute value may be represented by several documents. For example, the student attribute value may be associated with documents Wiki:Dean's List, Wiki:Master's degree, etc. In this case, the scores per document have to be aggregated to form the final scores a 0 ...a T for each attribute value in V . In our experiments, we consider the following aggregation techniques: (i) average (which allows multiple documents to contribute to the final ranking) and (ii) max (which may help when the document collection is noisy and we care only about the top-scoring document for each value). Having obtained the final attribute scores a 0 ...a T , we sort them to get the top value as the model prediction.

Training
While predicting attribute values is not inherently a reinforcement learning problem, we utilize the REINFORCE policy gradient method (Sutton et al., 2000) to train the cue detector component because there are no labels indicating which input terms should be selected. This allows the cue detector to be trained based on the correct attribute values regardless of the non-differentiable argmax operation needed to identify the K top scoring terms from the scores it outputs.
When using the policy gradient method, the state in our system is represented by a sequence of input terms t 0 ...t M . Each of the M input terms also represents an independent action. The term scoring model acts as the policy, which outputs the term selection probabilities based on the current state. Then a term is sampled (at training time) or the term with maximum probability is selected (at prediction time) and added to the query.
During training, we form the query sampling without replacement one word at a time. After sampling each term, we issue the current query and get intermediate feedback. The training episode ends when the query reaches its maximum length K. We define the reward r τ for an intermediate query to be the normalized discounted cumulative gain (the nDCG ranking metric) of the correct attribute values' scores after aggregation at timestep τ . The objective of REINFORCE is to maximize J = K τ =1 r τ * log p τ by updating the weights of the policy network (where p τ is the probability of selecting a term at timestep τ ). The datasets used in our experiments cover two types of input: (i) users' utterances along with their corresponding attribute-value pairs (e.g., hobby:brewing from the example in Figure 2), and (ii) a collection of documents associated with each attribute value (e.g., documents describing brewing as a hobby). We consider two exemplary attributes: profession and hobby. We define lists of their attribute values based on Wikipedia lists 3 .

Users' utterances
We consider publicly-available Reddit submissions and comments 4 from 2006 to 2018 as users' utterances. Given a Reddit user having a set of utterances U = u 0 ..u N , we aim to label the user with a set of profession and hobby values, based on explicit personal assertions (e.g., "I work as a doctor") found in the user's posts. To label the candidate users with attribute values we utilized the Snorkel framework (Ratner et al., 2017). We provide details on our data labeling using Snorkel in Appendix A.1.
For our experiments, we removed all posts containing explicit personal assertions that we used for labeling each user, because we want to test the ability of CHARM to predict attribute values based on inference, as opposed to explicit pattern extraction. The final dataset consists of 6000 users per attribute, with a maximum of 500 and an average of 23 users per attribute value. The number of attribute values for hobby and profession attributes is 149 and 71 respectively.
We evaluated the quality of Snorkel labeling on a held-out validation set, which we manually annotated. The validation set contains roughly 100 users per attribute, and was annotated with attribute values agreed by at least two out of three judges. The labeling obtained by Snorkel corresponded to 0.9 precision on the validation set. To demonstrate that Snorkel provides the same level of quality as crowdsourcing, we calculated the precision of human annotators on the same validation set by comparing the labels of each annotator against the agreement labels. The obtained precision scores were 0.91 for profession and 0.88 for hobby, demonstrating that Snorkel is a reasonable alternative.

Document collection
The scope of possible attribute values may be openended in nature, and thus, calls for an automatic method for collecting Web documents. In this work, we consider three different Web document collections; summary statistics on the number of documents per attribute value are provided in Table 1. Each document may be associated with multiple attribute values. To provide more diversity and comprehensiveness we augmented our pre-defined lists of known attribute values with their synonyms and hyponyms. 5 Note that the approaches used to construct the document collections are straightforward and easily applicable for further attributes, such as favorite travel destination or favorite book genre.
Wikipedia pages (Wiki-page). To create this collection we take the lists of known attribute values and automatically retrieve a Wikipedia page corresponding to each value, which usually coincides with the article title (e.g., Wiki:Barista).

Wikipedia pages-extended (Wiki-category).
This collection is an extension of Wiki-page that additionally includes pages found using Wikipedia categories. This allows us to include pages about concepts related to the attribute values, such as tools used for a profession and the profession's specializations. To construct Wiki-category, we identified at least one relevant category for each attribute value and included all leaf pages under the category (i.e., including no subcategories).
Web search. To create this collection we queried a Web search engine using attribute-specific patterns: "my profession as profession value " and "my favorite hobby is hobby value ". The collection consists of the top 100 documents returned for each value. Such patterns can be created with low effort by evaluating a few sample queries. Alternatively, patterns could be mined from a corpus or simplified to the generic form " attribute value ".

Experimental Setup
We evaluate the proposed method's performance in two experimental settings. First, we consider a zero-shot setting in which the attribute values in the training and test data are completely disjoint (i.e., the test set only contains unseen labels). This setting evaluates how well CHARM can predict attribute values that were not observed during training. Second, we consider the standard classification scenario in which all attribute values are seen as labels in both training and test sets.This demonstrates that CHARM's performance in a normal classification setting does not substantially degrade because of its proposed architecture. Experimental setup details differ for these two evaluation settings, which will be discussed in the following subsections. All our models were implemented in PyTorch; technical details are in Appendix B. The code and labeled datasets will be made publicly available upon acceptance.
Training and test data. For the unseen experiments, we perform ten fold cross-validation with folds constructed such that each attribute value appears in only one test fold. Each of the folds contains roughly the same number of users and approximately 2-4 unique attribute values. 6 We assigned the users having multiple attribute values to a fold corresponding to one of their randomly chosen values. For the experiments with seen values, we randomly split the users into training and test sets in a 9:1 proportion, respectively.
Hyperparameters. BERT, the term selection component, generates a contextualized embedding for each input term, which we process with a fully connected layer to produce a term score for each word in its context. Specifically, we use the pre-trained BERT base-uncased model with 12 transformer layers. To reduce BERT's computational requirements, we discard the last 6 transformer layers (i.e., we use embeddings produced by the earliest 6 layers) after observing in pilot experiments that this outperformed a distilled BERT model. (Sanh et al., 2019) Following prior work (Hui et al., 2018), KNRM was trained with frozen word2vec embeddings on data from the 2011-2014 TREC Web Track with the 2009-2010 years for validation. We initialize KNRM with these pre-trained weights.
During training, we sample 5 negative labels (i.e., incorrect attribute values) to be ranked when calculating the nDCG reward. For each label, we sample a subset of 15 documents to represent the label (i.e., attribute value). If the document collection has fewer than 15 documents for a label (e.g., Wikipage), we consider all the label's available documents. When making predictions, we consider all documents and all labels (values). In both settings, we truncate documents to 800 terms when using KNRM for efficiency and use the full documents with BM25. We use ten fold cross-validation on the training data to optimize the following hyperparameters in a grid search: (i) document aggregation strategy (average vs max); (ii) length of query; and (iii) maximum number of epochs. Further details on the hyperparameter search are in Appendix B.
Baselines. For the unseen experiments, we evaluate CHARM's performance against an end-to-end BERT ranking method and against a BM25 (Robertson and Zaragoza, 2009) ranker combined with two state-of-the-art unsupervised keyword extraction methods: TextRank and RAKE. We additionally include a baseline giving the user's full utterances as input to BM25 (baseline: No-keyword).
Following related work (Nogueira and Cho, 2019; Dai and Callan, 2019), we train the BERT IR baseline using a binary cross-entropy loss to predict the relevance of each document to the user's utterances (acting as queries). We use the same pre-trained BERT model as in CHARM. To fit both utterances and documents into the input size of BERT, we split both into 256-token chunks and run BERT on their Cartesian product. To obtain the final score for each utterances-document pair we average across all chunk pairs. Given N utterances and M documents, this baseline processes N × M inputs with BERT, whereas CHARM processes N inputs with BERT and M inputs with an efficient ranking method. This makes the BERT IR baseline very computationally expensive on the Wiki-category and Web search document collec-   tions, which contain 4,000-12,000 documents. In order to run the baseline on these collections, we sample three documents per label; even with this change, BERT IR is 60x slower than CHARM. More details on the models' running time are in Appendix B. We use the full document collection with Wiki-page. For the seen experimental setup, we compare CHARM with both state-of-the-art supervised approaches for inferring attribute values and a finetuned supervised BERT model that performs classification using its [CLS] representation. The Hidden Attribute Model (HAM 2attn ) (Tigunova et al., 2019) is an attention-based neural classification model for inferring users' attribute values. N-GrAM (Basile et al., 2017) is a SVM classifier with n-gram features. W2V-C (Preoţiuc-Pietro et al., 2015) is a Gaussian Process (GP) classifier with embedding clusters as features. Finally, we include a neural CNN-based model (Bayot and Gonçalves, 2018). In this setup the baseline models are singlevalue, therefore, we split every multi-value user  into several inputs through all their attribute values.
Evaluation metrics. Given the difficulty of inferring the correct attribute values for an attribute with many possible values, ranking metrics are the most informative and have been used in prior work (Tigunova et al., 2019;Preoţiuc-Pietro et al., 2015). We consider MRR (Mean Reciprocal Rank) and nDCG (normalized Discounted Cumulative Gain). Given that MRR assumes there is only one correct attribute value for each user, we calculate MRR independently for each attribute value before averaging. We average nDCG over users.
6 Results and Discussion

Quantitative Results
Unseen values (zero-shot mode). The models' performance evaluated only on values that were not observed during training is shown in  the more general keywords usually given by unsupervised keyword extractors. BERT IR performs similarly to CHARM for the Wiki-page dataset, but performs significantly worse for the remaining datasets while taking approximately 60x longer than CHARM KNRM to perform inference. For both attributes, CHARM KNRM always outperforms the BM25 variant with Wiki-category and Web search collections. This may be related to the size of document collections which allow for more variations in the vocabularies that are captured well by embeddings with KNRM. Another observation is that for CHARM KNRM , while Web search yields the best result for profession, Wikicategory is the best collection for hobby, possibly due to the noisy hobby-related documents from web search. CHARM BM25 on Wiki-page does not require any additional inputs and consistently performs as well as or better than the baselines across both attributes. Wiki-category performs significantly better than all baselines for both attributes, making it a reasonable choice when Wikipedia categories are available.
To demonstrate that the collections are resilient to inaccuracies in their automatic construction, we conducted an experiment where some percentage of the documents' attribute values were randomly changed. We found that randomly changing 20% of the documents' labels resulted in approximately a 15% MRR decrease for CHARM KNRM on Websearch and Wiki-category. The performance decrease on these collections was roughly linear. This indicates that noise in the document collection does not severely damage CHARM's performance.
Seen values (supervised mode). In this experiment we evaluate CHARM's performance in the fully supervised setting (i.e., all labels are seen during training). In Table 3 we observe that CHARM's performance is competitive compared to HAM 2attn (i.e., the best-performing attribute value prediction method from prior work) and the state-of-the-art BERT model. The fully supervised BERT model consistently performs the best for both attributes, though these increases are not statistically significant over all CHARM configurations. Furthermore, BERT and HAM 2attn are trained with full supervision in this experimental setting, whereas CHARM still uses a policy gradient. In this experiment, the Web search collection consistently performs best, suggesting that the collection's shortcomings are mitigated when all labels are observed.

Qualitative Analysis
Analysis of selected terms For each attribute value, we gathered all query terms that were selected for the users predicted as having the attribute value, together with the scores given by the cue detector. We then averaged the scores for each term within an attribute value, and selected top 10 terms as the representative ones. Terms were extracted using CHARM KNRM with Wiki-category on unseen experiments. We performed the same method for TextRank keywords, because this was the best performing keyword-based baseline in the unseen experiments. The comparison of selected terms by CHARM vs TextRank is reported in Table 4 and  Table 5 for selected attribute values of profession and hobby, respectively.
We can observe that, regardless of the small sample size for some values like airplane pilot, CHARM can still detect meaningful words. For barista, CHARM did not even consider the term 'barista', but rather focuses on words such as 'coffee' and 'starbucks'. Choosing terms like 'screenplay', 'scripts' and 'screenwriting' helps the model to distinguish screenwriter from other film-related professions like director.
Picking the terms like 'cake', 'baking' and 'bread', helps the model to distinguish between baking and cooking hobbies more effectively. Note, that even for rare unusual hobbies like quilting, CHARM manages to pick indicative terms. This essentially shows that the model can easily be used for large lists of attribute values, with long tail.
Finally, as opposed to CHARM, TextRank keywords rarely make sense. This suggests that unsupervised keyword detectors are not capable of producing useful attribute-value-related keywords from users' utterances.

Misclassification Study
To conduct error analysis,    we plotted confusion matrices of CHARM KNRM on unseen experiments, which are shown in Figure  3a and 3b for profession and hobby, respectively. We observe that medical professions such as dentist, nurse, pharmacist and surgeon are often confused to doctor in general. Professions associated with studying (academic, teacher and student), beauty (hairdresser and tattoo artist) and art (musician and poet) are often confused with each other. Salesman and accountant are confused to broker, because of the common financial terms used.
Hobbies associated with music (dancing, singing and music) and images (painting, graphic design and photography) are often mixed up. Hobbies in which the term 'game' is profusely used like chess and baseball are confused to board games; similarly, fishing and fish keeping, as well as skiing and snowboarding are confused due to the common lexicon used. Analysis of top ranked documents For each attribute value, we collected all documents that were returned for a user with the given value as the ground-truth label. We then averaged the scores for each page and select the top 5 retrieved pages from Wiki-category, shown in Table 6 for selected profession and hobby attribute values.
It is interesting to observe, that in spite of the common lexicon for some similar values, the model manages to retrieve documents which are relevant to a particular value, e.g., documents for investor are distinct from other financial-related professions, like broker or salesman. It is also worth mentioning that the retrieved pages for investor and ice hockey are rather the pages for related lexicon (venture capital, playoff beard), which shows the power of CHARM's cue detection.

Conclusion
We presented the CHARM method for inferring personal traits from conversations. CHARM differs from prior work by its zero-shot ability to predict attribute values that are not present in the training samples at all. We demonstrated the viability of CHARM for inferring users' unseen attribute values by comprehensive experiments with Reddit conversations, leveraging document collections from Wikipedia and web search results for CHARM's retrieval component.

Appendices A Data
All datasets used in the experiments are available at https://github.com/Anna146/CHARM. We provide IDs and texts of the posts used as training and test data for CHARM. All users are anonymized by replacing usernames with IDs. Additionally, we provide the posts containing explicit personal assertions, which have been used for ground truth labeling with the Snorkel framework.

A.1 Labeling users' utterances with Snorkel
Our data consists of submissions on Reddit, which are: (1) authored by users having 10-50 posts, (2) 10-40 words long, and (3) containing a personal pronoun (except for 3rd person ones. Requirements (1) and (2) were derived from observing the distributions on the full dataset. Requirement (3) comes from the assumption that posts containing personal pronouns are most likely to contain personal assertions. These restrictions allow us to select posts that look more similar to the real conversation (i.e., relatively short and containing references to the speakers with personal pronouns). In addition, we did not consider the following subreddit types: (i) dating, which may provide plenty of personal information but no real conversation to infer from, and (ii) fantasy/video games (for the profession attribute), because users may refer to gaming personalities. We took only users whose utterances contain at least one mention of attribute values, resulting in around 250K and 500K candidate users for profession and hobby, respectively.
We used the Snorkel framework (Ratner et al., 2017) that allows data labeling using weak supervision, relying on the inference that combines multiple labeling functions, which are manually specified and can be potentially noisy. Given a user's utterance set U , an attribute a and a possible attribute value v, Snorkel will decide on positive/negative label-denoting the user as having/not having personal trait a:v-or abstain label. We have separate labeling models for each attribute a, and defined two labeling functions which consider: (LF1) the existence of attribute-specific patterns, and (LF2) the weighted count of the words belonging to the value-specific lexicon. LF1: Attribute-specific patterns. We compiled a list of positive and negative patterns for each attribute (see Table 7), e.g., "my hobby is hobby-value " vs "I hate hobby-value " as positive vs negative patterns for hobby. LF1 labels a user with a positive/negative label for each attribute value v if there exist at least one positive/negative pattern in the user's utterances U , and abstain otherwise. LF2: Value-specific lexicon. For each attributevalue pair, we used Empath (Fast et al., 2016)pre-trained on the Reddit corpus-to build a lexicon of typical words (e.g., 'cider' and 'yeast' for hobby:brewing). Given seed words, Empath builds lexical categories by means of an embedding model. As our value-specific lexicon, we took the union of Empath terms for a specific attribute value and all its synonyms; each typical word is weighted by embedding similarity to the seed words. Given a user's utterance set U and an attribute value v, LF2 yields a positive label if the weighted count of typical words of v is above an empirically-chosen threshold, and abstain otherwise.
Given a pair of user's utterance set U and a possible attribute value v, the Snorkel probabilistic labeling model utilizes our labeling functions to predict a confidence score for the positive label, i.e., the user is labeled with attribute value v. As our labeled dataset, we took only the user-value pairs with confidence scores above a specific threshold.
To determine the threshold of confidence scores, we manually annotated a held-out validation set containing 100 users per attribute. Given a post and a set of attribute values mentioned explicitly in the post, the annotators must identify whether the candidate user traits truly hold. For instance, from "My dad bought me a chess board even though I enjoy video games more", hobby:video games is correct while hobby:chess is not applicable. The final annotation for each post consists of attribute values agreed by at least two out of three judges. The selected confidence threshold corresponds to the 0.9 precision of the model on the validation set. After thresholding, we obtained 13.5k users labeled with profession values and 11.7k users with hobby values.
Finally, for practical reasons, for each attribute we sorted the labeled users by confidence scores and cropped the set to maximum 500 users per attribute value and 6000 users in total. Note that users might have multiple values for each attribute (e.g., having brewing and swimming as hobbies); there are 605 such users for profession and 245 for hobby. Table 7: Positive and negative patterns used in the labeling function LF1 of the Snorkel labeling model. Each pattern must be followed by possible attribute values within a context window of 2 terms.

B Training details and hyperparameters
In our experiments we used the server with 32 cores (2x Intel Xeon Gold 6242, 16C/32T 22MB) and 2 GPU NVIDIA Corporation GV100 [Tesla V100]. On this server the running time of our models was fast, compared to the baseline BERT IR architecture as shown in Table 8. BERT IR inference is slow because for a single utterance-document pair it makes several passes through BERT for each chunk combination, which is repeated for every document. CHARM runs BERT once on each ut-  terance only, independent of the number of documents. Using BM25 as a ranker is slower because it requires iteration through the query-document inputs to calculate term frequencies, whereas KNRM uses efficient vectorized representations of the inputs. However, it is possible to speed up BM25 inference, by providing a precomputed inverted index.
The numbers of parameters in CHARM KNRM model are shown in Table 9. We used manual tuning to search for the hyperparameters, running about 280 search trials per attribute and collection combination. Several hyperparameters were fixed across different setups (across attributes, document collections and rankers) and some we tuned to each setup individually. The bounds for each hyperparameter and the best parameters are in Tables 11, 10. The best parameters were chosen based on the MRR score. Additionally we performed some experiments on changing the policy gradient training setup, adding discounting factor to the reward after each sampled query term and changing the reward from nDCG to MRR. We found that the results after these modifications did not significantly change.   , max  avg  avg  max  avg  avg  avg  avg  max  avg  max  avg  avg  training epochs  1-50, step 2  19  23  21  23  21  21  17  23  15  43  27  17  query length  10-25 step 5  15  25  10  10  15  15  10  15  15  10 15 10