Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings

We study a symmetric collaborative dialogue setting in which two agents, each with private knowledge, must strategically communicate to achieve a common goal. The open-ended dialogue state in this setting poses new challenges for existing dialogue systems. We collected a dataset of 11K human-human dialogues, which exhibits interesting lexical, semantic, and strategic elements. To model both structured knowledge and unstructured language, we propose a neural model with dynamic knowledge graph embeddings that evolve as the dialogue progresses. Automatic and human evaluations show that our model is both more effective at achieving the goal and more human-like than baseline neural and rule-based models.


Introduction
Current task-oriented dialogue systems (Young et al., 2013;Wen et al., 2017;Dhingra et al., 2017) require a pre-defined dialogue state (e.g., slots such as food type and price range for a restaurant searching task) and a fixed set of dialogue acts (e.g., request, inform). However, human conversation often requires richer dialogue states and more nuanced, pragmatic dialogue acts. Recent opendomain chat systems (Shang et al., 2015;Serban et al., 2015b;Sordoni et al., 2015;Li et al., 2016a;Lowe et al., 2017;Mei et al., 2017) learn a mapping directly from previous utterances to the next utterance. While these models capture open-ended aspects of dialogue, the lack of structured dialogue state prevents them from being directly applied to settings that require interfacing with structured knowledge.
In order to bridge the gap between the two types  Figure 1: An example dialogue from the Mutual-Friends task in which two agents, A and B, each given a private list of a friends, try to identify their mutual friend. Our objective is to build an agent that can perform the task with a human. Crosstalk (Section 2.3) is italicized.
of systems, we focus on a symmetric collaborative dialogue setting, which is task-oriented but encourages open-ended dialogue acts. In our setting, two agents, each with a private list of items with attributes, must communicate to identify the unique shared item. Consider the dialogue in Figure 1, in which two people are trying to find their mutual friend. By asking "do you have anyone who went to columbia?", B is suggesting that she has some Columbia friends, and that they probably work at Google. Such conversational implicature is lost when interpreting the utterance as simply an information request. In addition, it is hard to define a structured state that captures the diverse semantics in many utterances (e.g., defining "most of", "might be"; see details in Table 1).
To model both structured and open-ended context, we propose the Dynamic Knowledge Graph Network (DynoNet), in which the dialogue state is modeled as a knowledge graph with an embedding arXiv:1704.07130v1 [cs.CL] 24 Apr 2017 for each node (Section 3). Our model is similar to EntNet (Henaff et al., 2017) in that node/entity embeddings are updated recurrently given new utterances. The difference is that we structure entities as a knowledge graph; as the dialogue proceeds, new nodes are added and new context is propagated on the graph. An attention-based mechanism (Bahdanau et al., 2015) over the node embeddings drives generation of new utterances. Our model's use of knowledge graphs captures the grounding capability of classic task-oriented systems and the graph embedding provides the representational flexibility of neural models.
The naturalness of communication in the symmetric collaborative setting enables large-scale data collection: We were able to crowdsource around 11K human-human dialogues on Amazon Mechanical Turk (AMT) in less than 15 hours. 1 We show that the new dataset calls for more flexible representations beyond fully-structured states (Section 2.2).
In addition to conducting the third-party human evaluation adopted by most work (Liu et al., 2016;Li et al., 2016b,c), we also conduct partner evaluation (Wen et al., 2017) where AMT workers rate their conversational partners (other workers or our models) based on fluency, correctness, cooperation, and human-likeness. We compare DynoNet with baseline neural models and a strong rulebased system. The results show that DynoNet can perform the task with humans efficiently and naturally; it also captures some strategic aspects of human-human dialogues.
The contributions of this work are: (i) a new symmetric collaborative dialogue setting and a large dialogue corpus that pushes the boundaries of existing dialogue systems; (ii) DynoNet, which integrates semantically rich utterances with structured knowledge to represent open-ended dialogue states; (iii) multiple automatic metrics based on bot-bot chat and a comparison of third-party and partner evaluation.

Symmetric Collaborative Dialogue
We begin by introducing a collaborative task between two agents and describe the human-human dialogue collection process. We show that our data exhibits diverse, interesting language phenomena.

Task Definition
In the symmetric collaborative dialogue setting, there are two agents, A and B, each with a private knowledge base-KB A and KB B , respectively. Each knowledge base includes a list of items, where each item has a value for each attribute. For example, in the MutualFriends setting, Figure 1, items are friends and attributes are name, school, etc. There is a shared item that A and B both have; their goal is to converse with each other to determine the shared item and select it. Formally, an agent is a mapping from its private KB and the dialogue thus far (sequence of utterances) to the next utterance to generate or a selection. A dialogue is considered successful when both agents correctly select the shared item. This setting has parallels in human-computer collaboration where each agent has complementary expertise.

Data collection
We created a schema with 7 attributes and approximately 3K entities (attribute values). To elicit linguistic and strategic variants, we generate a random scenario for each task by varying the number of items (5 to 12), the number attributes (3 or 4), and the distribution of values for each attribute (skewed to uniform). See Appendix A and B for details of schema and scenario generation. We crowdsourced dialogues on AMT by randomly pairing up workers to perform the task within 5 minutes. 2 Our chat interface is shown in Figure 2. To discourage random guessing, we prevent workers from selecting more than once every 10 seconds. Our task was very popular and we col-

Dataset statistics
We show the basic statistics of our dataset in Table 3. An utterance is defined as a message sent by one of the agents. The average utterance length is short due to the informality of the chat, however, an agent usually sends multiple utterances in one turn. Some example dialogues are shown in Table 6 and Appendix I.  We categorize utterances into coarse typesinform, ask, answer, greeting, apology-by pattern matching (Appendix E). There are 7.4% multitype utterances, and 30.9% utterances contain more than one entity. In Table 1, we show example utterances with rich semantics that cannot be sufficiently represented by traditional slot-values. Some of the standard ones are also non-trivial due to coreference and logical compositionality.
Our dataset also exhibits some interesting communication phenomena. Coreference occurs frequently when people check multiple attributes of one item. Sometimes mentions are dropped, as an utterance simply continues from the partner's utterance. People occasionally use external knowledge to group items with out-of-schema attributes (e.g., gender based on names, location based on schools). We summarize these phenomena in Table 2. In addition, we find 30% utterances involve cross-talk where the conversation does not progress linearly (e.g., italic utterances in Figure 1), a common characteristic of online chat (Ivanovic, 2005).
One strategic aspect of this task is choosing the order of attributes to mention. We find that people tend to start from attributes with fewer unique values, e.g., "all my friends like morning" given the KB B in Table 6, as intuitively it would help exclude items quickly given fewer values to check. 5 We provide a more detailed analysis of strategy in Section 4.2 and Appendix F.

Dynamic Knowledge Graph Network
The diverse semantics in our data motivates us to combine unstructured representation of the dialogue history with structured knowledge. Our  Figure 3: Overview of our approach. First, the KB and dialogue history (entities in bold) is mapped to a graph. Here, an item node is labeled by the item ID and an attribute node is labeled by the attribute's first letter. Next, each node is embedded using relevant utterance embeddings through message passing. Finally, an LSTM generates the next utterance based on attention over the node embeddings. model consists of three components shown in Figure 3: (i) a dynamic knowledge graph, which represents the agent's private KB and shared dialogue history as a graph (Section 3.1), (ii) a graph embedding over the nodes (Section 3.2), and (iii) an utterance generator (Section 3.3).
The knowledge graph represents entities and relations in the agent's private KB, e.g., item-1's company is google. As the conversation unfolds, utterances are embedded and incorporated into node embeddings of mentioned entities. For instance, in Figure 3, "anyone went to columbia" updates the embedding of columbia. Next, each node recursively passes its embedding to neighboring nodes so that related entities (e.g., those in the same row or column) also receive information from the most recent utterance. In our example, jessica and josh both receive new context when columbia is mentioned. Finally, the utterance generator, an LSTM, produces the next utterance by attending to the node embeddings.

Knowledge Graph
Given a dialogue of T utterances, we construct graphs (G t ) T t=1 over the KB and dialogue history for agent A. 6 There are three types of nodes: item nodes, attribute nodes, and entity nodes. Edges between nodes represent their relations. For example, (item-1, hasSchool, columbia) means that the first item has attribute school whose value 6 It is important to differentiate perspectives of the two agents as they have different KBs. Thereafter we assume the perspective of agent A, i.e., accessing KBA for A only, and refer to B as the partner. is columbia. An example graph is shown in Figure 3. The graph G t is updated based on utterance t by taking G t−1 and adding a new node for any entity mentioned in utterance t but not in KB A . 7

Graph Embedding
Given a knowledge graph, we are interested in computing a vector representation for each node v that captures both its unstructured context from the dialogue history and its structured context in the KB. A node embedding V t (v) for each node v ∈ G t is built from three parts: structural properties of an entity defined by the KB, embeddings of utterances in the dialogue history, and message passing between neighboring nodes.
Node Features. Simple structural properties of the KB often govern what is talked about; e.g., a high-frequency entity is usually interesting to mention (consider "All my friends like dancing.").
We represent this type of information as a feature vector F t (v), which includes the degree and type (item, attribute, or entity type) of node v, and whether it has been mentioned in the current turn. Each feature is encoded as a one-hot vector and they are concatenated to form F t (v).
Mention Vectors. A mention vector M t (v) contains unstructured context from utterances relevant to node v up to turn t. To compute it, we first define the utterance representationũ t and the set of relevant entities E t . Let u t be the embedding of utterance t (Section 3.3). To differentiate between the agent's and the partner's utterances, we repre- where U self and U partner denote sets of utterances generated by the agent and the partner, and [·, ·] denotes concatenation. Let E t be the set of entity nodes mentioned in utterance t if utterance t mentions some entities, or utterance t − 1 otherwise. 8 The (1) Here, σ is the sigmoid function and W inc is a parameter matrix.
Recursive Node Embeddings. We propagate information between nodes according to the structure of the knowledge graph. In Figure 3, given "anyone went to columbia?", the agent should focus on her friends who went to Columbia University. Therefore, we want this utterance to be sent to item nodes connected to columbia, and one step further to other attributes of these items because they might be mentioned next as relevant information, e.g., jessica and josh. We compute the node embeddings recursively, analogous to belief propagation: where V k t (v) is the depth-k node embedding at turn t and N t (v) denotes the set of nodes adjacent to v. The message from a neighboring node v depends on its embedding at depth-(k − 1), the edge label e v→v (embedded by a relation embedding function R), and a parameter matrix W mp . Messages from all neighbors are aggregated by max, the element-wise max operation. 9 Example message passing paths are shown in Figure 3.
The final node embedding is the concatenation of embeddings at each depth: where K is a hyperparameter (we experiment with 8 Relying on utterance t − 1 is useful when utterance t answers a question, e.g., "do you have any google friends?" "No." 9 Using sum or mean slightly hurts performance.

Utterance Embedding and Generation
We embed and generate utterances using Long Short Term Memory (LSTM) networks that take the graph embeddings into account.
Embedding. On turn t, upon receiving an utterance consisting of n t tokens, x t = (x t,1 , . . . , x t,nt ), the LSTM maps it to a vector as follows: where h t,0 = h t−1,n t−1 , and A t is an entity abstraction function, explained below. The final hidden state h t,nt is used as the utterance embedding u t , which updates the mention vectors as described in Section 3.2. In our dialogue task, the identity of an entity is unimportant. For example, replacing google with alphabet in Figure 1 should make little difference to the conversation. The role of an entity is determined instead by its relation to other entities and relevant utterances. Therefore, we define the abstraction A t (y) for a word y as follows: if y is linked to an entity v, then we represent an entity by its type (school, company etc.) embedding concatenated with its current node embedding: is determined only by its structural features and its context. If y is a non-entity, then A t (y) is the word embedding of y concatenated with a zero vector of the same dimensionality as V t (v). This way, the representation of an entity only depends on its structural properties given by the KB and the dialogue context, which enables the model to generalize to unseen entities at test time.
Generation. Now, assuming we have embedded utterance x t−1 into h t−1,n t−1 as described above, we use another LSTM to generate utterance x t . Formally, we carry over the last utterance embedding h t,0 = h t−1,n t−1 and define: where c t,j is a weighted sum of node embeddings in the current turn: , where α t,j,v are the attention weights over the nodes. Intuitively, high weight should be given to relevant entity nodes as shown in Figure 3. We compute the weights through standard attention mechanism (Bahdanau et al., 2015): where vector w attn and W attn are parameters.
Finally, we define a distribution over both words in the vocabulary and nodes in G t using the copying mechanism of Jia and Liang (2016): where y is a word in the vocabulary, W vocab and b are parameters, and r(v) is the realization of the entity represented by node v, e.g., google is realized to "Google" during copying. 10

Experiments
We compare our model with a rule-based system and a baseline neural model. Both automatic and human evaluations are conducted to test the models in terms of fluency, correctness, cooperation, and human-likeness. The results show that DynoNet is able to converse with humans in a coherent and strategic way.

Setup
We randomly split the data into train, dev, and test sets (8:1:1). We use a one-layer LSTM with 100 hidden units and 100-dimensional word vectors for both the encoder and the decoder (Section 3.3). Each successful dialogue is turned into two examples, each from the perspective of one of the two agents. We maximize the log-likelihood of all utterances in the dialogues. The parameters are optimized by AdaGrad (Duchi et al., 2010) with an initial learning rate of 0.5. We trained for at least 10 epochs; after that, training stops if there is no improvement on the dev set for 5 epochs. By default, we perform K = 2 iterations of message passing to compute node embeddings (Section 3.2). For decoding, we sequentially sample from the output distribution with a softmax temperature of 0.5. 11 Hyperparameters are tuned on the dev set.
We compare DynoNet with its static cousion (StanoNet) and a rule-based system (Rule). StanoNet uses G 0 throughout the dialogue, thus the dialogue history is completely contained in the LSTM states instead of being injected into the knowledge graph. Rule maintains weights for each entity and each item in the KB to decide what to talk about and which item to select. It has a pattern-matching semantic parser, a rulebased policy, and a templated generator. See Appendix G for details.

Evaluation
We test our systems in two interactive settings: bot-bot chat and bot-human chat. We perform both automatic evaluation and human evaluation.
Automatic Evaluation. First, we compute the cross-entropy ( ) of a model on test data. As shown in Table 4, DynoNet has the lowest test loss. Next, we have a model chat with itself on the scenarios from the test set. 12 We evaluate the chats with respect to language variation, effectiveness, and strategy.
For language variation, we report the average utterance length L u and the unigram entropy H in Table 4. Compared to Rule, the neural models tend to generate shorter utterances (Li et al., 2016b;Serban et al., 2017b). However, they are more diverse; for example, questions are asked in multiple ways such as "Do you have ...", "Any friends like ...", "What about ...".
At the discourse level, we expect the distribution of a bot's utterance types to match the distribution of human's. We show percentages of each utterance type in Table 4. For Rule, the decision about which action to take is written in the rules, while StanoNet and DynoNet learned to behave in a more human-like way, frequently informing and asking questions.
To measure effectiveness, we compute the overall success rate (C) and the success rate per turn (C T ) and per selection (C S ). As shown in Table 4, humans are the best at this game, followed by Rule which is comparable to DynoNet.
Next, we investigate the strategies leading to these results. An agent needs to decide which entity/attribute to check first to quickly reduce the search space. We hypothesize that humans tend to first focus on a majority entity and an attribute with fewer unique values (Section 2.3). For example, in the scenario in Table 6, time and location are likely to be mentioned first. We show the average frequency of first-mentioned entities (#Ent 1 ) and the average number of unique values for first-mentioned attributes (|Attr 1 |) in Ta Table 4: Automatic evaluation on human-human and bot-bot chats on test scenarios. We use ↑ / ↓ to indicate that higher / lower values are better; otherwise the objective is to match humans' statistics. Best results (except Human) are in bold. Neural models generate shorter (lower L u ) but more diverse (higher H) utterances. Overall, their distributions of utterance types match those of the humans'. (We only show the most frequent speech acts therefore the numbers do not sum to 1.) Rule is effective in completing the task (higher C S ), but it is not information-efficient given the large number of attributes (#Attr) and entities (#Ent) mentioned.
ble 4. 13 Both DynoNet and StanoNet successfully match human's starting strategy by favoring entities of higher frequency and attributes of smaller domain size.
To examine the overall strategy, we show the average number of attributes (#Attr) and entities (#Ent) mentioned during the conversation in Table 4. Humans and DynoNet strategically focus on a few attributes and entities, whereas Rule needs almost twice entities to achieve similar success rates. This suggests that the effectiveness of Rule mainly comes from large amounts of unselective information, which is consistent with comments from their human partners.
Partner Evaluation. We generated 200 new scenarios and put up the bots on AMT using the same chat interface that was used for data collection. The bots follow simple turn-taking rules explained in Appendix H. Each AMT worker is randomly paired with Rule, StanoNet, DynoNet, or another human (but the worker doesn't know which), and we make sure that all four types of agents are tested in each scenario at least once. At the end of each dialogue, humans are asked to rate their partner in terms of fluency, correctness, cooperation, and human-likeness from 1 (very bad) to 5 (very good), along with optional comments.
We show the average ratings (with significance tests) in Table 5 and the histograms in Appendix J. In terms of fluency, the models have similar performance since the utterances are usually short. Judgment on correctness is a mere guess since the evaluator cannot see the partner's KB; we will analyze correctness more meaningfully in the thirdparty evaluation below.
Noticeably, DynoNet is more cooperative than the other models. As shown in the example dialogues in Table 6, DynoNet cooperates smoothly with the human partner, e.g., replying with relevant information about morning/indoor friends when the partner mentioned that all her friends prefer morning and most like indoor. StanoNet starts well but doesn't follow up on the morning friend, presumably because the morning node is not updated dynamically when mentioned by the partner. Rule follows the partner poorly. In the comments, the biggest complaint about Rule was that it was not 'listening' or 'understanding'. Overall, DynoNet achieves better partner satisfaction, especially in cooperation.
Third-party Evaluation. We also created a third-party evaluation task, where an independent AMT worker is shown a conversation and the KB of one of the agents; she is asked to rate the same aspects of the agent as in the partner evaluation and provide justifications. Each agent in a dialogue is rated by at least 5 people.
The average ratings and histograms are shown in Table 5 and Appendix J. For correctness, we see that Rule has the best performance since it always tells the truth, whereas humans can make mistakes due to carelessness and the neural models can generate false information. For example, in Table 6, DynoNet 'lied' when saying that it has a morning friend who likes outdoor.
Surprisingly, there is a discrepancy between the two evaluation modes in terms of cooperation and human-likeness. Manual analysis of the comments indicates that third-party evaluators focus less on the dialogue strategy and more on linguistic features, probably because they were not fully engaged in the dialogue. For example, justification  We report the average ratings of each system. For third-party evaluation, we first take mean of each question then average the ratings. DynoNet has the best partner satisfaction in terms of fluency (Flnt), correctness (Crct), cooperation (Coop), human likeness (Human). The superscript of a result indicates that its advantage over other systems (r: Rule, s: StanoNet, d: DynoNet) is statistically significant with p < 0.05 given by paired t-tests.
for cooperation often mentions frequent questions and timely answers, less attention is paid to what is asked about though. For human-likeness, partner evaluation is largely correlated with coherence (e.g., not repeating or ignoring past information) and task success, whereas third-party evaluators often rely on informality (e.g., usage of colloquia like "hiya", capitalization, and abbreviation) or intuition. Interestingly, third-party evaluators noted most phenomena listed in Table 2 as indicators of humanbeings, e.g., correcting oneself, making chit-chat other than simply finishing the task. See example comments in Appendix K.

Ablation Studies
Our model has two novel designs: entity abstraction and message passing for node embeddings. Table 7 shows what happens if we ablate these. When the number of message passing iterations, K, is reduced from 2 to 0, the loss consistently increases. Removing entity abstraction-meaning adding entity embeddings to node embeddings and the LSTM input embeddings-also degrades performance. This shows that DynoNet benefits from contextually-defined, structural node embeddings rather than ones based on a classic lookup table.

Discussion and Related Work
There has been a recent surge of interest in end-to-end task-oriented dialogue systems, though progress has been limited by the size of available datasets (Serban et al., 2015a). Most work focuses on information-querying tasks, using Wizard-of-Oz data collection (Williams et al., 2016;Asri et al., 2016) or simulators Li et al., 2016d), In contrast, collaborative dialogues are easy to collect as natural human conversations, and are also challenging enough given the large number of scenarios and diverse conversation phenomena. There are some interesting strategic dialogue datasets-settlers of Catan (Afantenos et al., 2012) (2K turns) and the cards corpus (Potts, 2012) (1.3K dialogues), as well as work on dialogue strategies (Keizer et al., 2017;Vogel et al., 2013), though no full dialogue system has been built for these datasets.
Most task-oriented dialogue systems follow the POMDP-based approach (Williams and Young, 2007;Young et al., 2013). Despite their success (Wen et al., 2017;Dhingra et al., 2017;Su et al., 2016), the requirement for handcrafted slots limits their scalability to new domains and burdens data collection with extra state labeling. To go past this limit,  proposed a Memory-Networks-based approach without domain-specific features. However, the memory is unstructured and interfacing with KBs relies on API calls, whereas our model embeds both the dialogue history and the KB structurally. Williams et al. (2017) use an LSTM to automatically infer the dialogue state, but as they focus on dialogue control rather than the full problem, the response is modeled as a templated action, which restricts the generation of richer utterances. Our network architecture is most similar to EntNet (Henaff et al., 2017), where memories are also updated by input sentences recurrently. The main difference is that our model allows information to be propagated between structured entities, which is shown to be crucial in our setting (Section 4.3).
Our work is also related to language generation conditioned on knowledge bases (Mei et al., 2016;Kiddon et al., 2016). One challenge here is to   Table 7: Ablations of our model on the dev set show the importance of entity abstraction and message passing (K = 2). avoid generating false or contradicting statements, which is currently a weakness of neural models. Our model is mostly accurate when generating facts and answering existence questions about a single entity, but will need a more advanced attention mechanism for generating utterances involving multiple entities, e.g., attending to items or attributes first, then selecting entities; generating high-level concepts before composing them to natural tokens (Serban et al., 2017a).
In conclusion, we believe the symmetric collaborative dialogue setting and our dataset pro-vide unique opportunities at the interface of traditional task-oriented dialogue and open-domain chat. We also offered DynoNet as a promising means for open-ended dialogue state representation. Our dataset facilitates the study of pragmatics and human strategies in dialogue-a good stepping stone towards learning more complex dialogues such as negotiation. DARPA Communicating with Computers (CwC) program under ARO prime contract no. W911NF-15-1-0462. Mike Kayser worked on an early version of the project while he was at Stanford. We also thank members of the Stanford NLP group for insightful discussions.
Reproducibility. All code, data, and experiments for this paper are available on the CodaLab platform: https: //worksheets.codalab.org/worksheets/ 0xc757f29f5c794e5eb7bfa8ca9c945573. each dialogue participant. We instruct people to play intelligently, to refrain from brute-force tactics (e.g., mentioning every attribute value), and to use grammatical sentences. To discourage random guessing, we prevent users from selecting a friend (item) more than once every 10 seconds. Each worker was paid $0.35 for a successful dialogue within a 5-minute time limit. We log each utterance in the dialogue along with timing information.

D Entity Linking and Realization
We use a rule-based lexicon to link text spans to entities. For every entity in the schema, we compute different variations of its canonical name, including acronyms, strings with a certain edit distance, prefixes, and morphological variants. Given a text span, a set of candidate entities is returned by string matching. A heuristic ranker then scores each candidate (e.g., considering whether the span is a substring of a candidate, the edit distance between the span and a candidate etc.). The highestscoring candidate is returned.
A linked entity is considered as a single token and its surface form is ignored in all models. At generation time, we realize an entity by sampling from the empirical distribution of its surface forms in the training set.

E Utterance Categorization
We categorize utterances into inform, ask, answer, greeting, apology heuristically by pattern matching.
• An ask utterance asks for information regarding the partner's KB. We detect these utterances by checking for the presence of a '?' and/or a question word like "do", "does", "what", etc.
• An inform utterance provides information about the agent's KB. We define it as an utterances that mentions entities in the KB and is not an ask utterance.
• An answer utterance simply provides a positive/negative response to a question, containing words like "yes", "no", "nope", etc.
• A greeting utterance contains words like "hi" or "hello"; it often occurs at the beginning of a dialogue.
• An apology utterance contains the word "sorry", which is typically associated with corrections and wrong selections.
See Table 2 and Table 1 for examples of these utterance types.

F Strategy
During scenario generation, we varied the number of attributes, the number of items in each KB, and the distribution of values for each attribute. We find that as the number of items and/or attributes grows, the dialogue length and the completion time also increase, indicating that the task becomes harder. We also anticipated that varying the value of α would impact the overall strategy (for example, the order in which attributes are mentioned) since α controls the skewness of the distribution of values for an attribute.
On examining the data, we find that humans tend to first mention attributes with a more skewed (i.e., less uniform) distribution of values. Specifically, we rank the α values of all attributes in a scenario (see step 3 in Section B), and bin them into 3 distribution groups-least uniform, medium, and most uniform, according to the ranking where higher α values corresponds to more uniform distributions. 15 In Figure 4, we plot the histogram of the distribution group of the first-mentioned attribute in a dialogues, which shows that skewed attributes are mentioned much more frequently.    Human-likeness -you have any friends who went to monmouth? -The flow was nice and they were able to discern the correct answers.
-human like because of interaction talking -Answers are human like, not robotic. Uses "hiya" to begin conversation, more of a warm tone.
-more human than computer Agent 2: hiya Agent 1: Hey Rule 2 Didn't listen to me 4 -agent 2 looked human to me -definitely human -A2 could be replaced with a robot without noticeable difference.
-They spoke and behaved as I or any human would in this situation.
-The agent just seems to be going through the motions, which gives me the idea that the agent doesn't exbit humanlike characteristics.

StanoNet 5
Took forever and didn't really respond correctly to questions.

3.5
-No djarum -This doesn't make sense in this context, so doesn't seem to be written by a human.
-human like because of slight mispellingss -Can tell they are likely human but just not very verbose -Their terse conversion leans to thinking they were either not paying attention or not human.
-The short vague sentences are very human like mistakes.
DynoNet 4 I replied twice that I only had indoor friends and was ignored.
3.8 -Agent 1 is very human like based on the way they typed and the fact that they were being deceiving.
-Pretty responsive and logical progression, but it's very stilted sounding -i donot have a jose -Agent gives normal human responses, "no angela i don't" -agent 1 was looking like a humanlike Table 9: Comparison of ratings and comments on human-likeness from partners and third-party evaluators. Each row contains results for the same dialogue. For the partner evaluation, we ask the human partner to provide a single, optional comment at the end of the conversation. For the third-party evaluation, we ask five Turkers to rate each dialogue and report the mean score; they must provide justification for ratings in each aspect. From the comments, we see that dialogue partners focus more on cooperation and effectiveness, whereas third-party evaluators focus more on linguistic features such as verbosity and informality.