Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

The lack of time-efficient and reliable evaluation methods hamper the development of conversational dialogue systems (chatbots). Evaluations requiring humans to converse with chatbots are time and cost-intensive, put high cognitive demands on the human judges, and yield low-quality results. In this work, we introduce \emph{Spot The Bot}, a cost-efficient and robust evaluation framework that replaces human-bot conversations with conversations between bots. Human judges then only annotate for each entity in a conversation whether they think it is human or not (assuming there are humans participants in these conversations). These annotations then allow us to rank chatbots regarding their ability to mimic the conversational behavior of humans. Since we expect that all bots are eventually recognized as such, we incorporate a metric that measures which chatbot can uphold human-like behavior the longest, i.e., \emph{Survival Analysis}. This metric has the ability to correlate a bot's performance to certain of its characteristics (e.g., \ fluency or sensibleness), yielding interpretable results. The comparably low cost of our framework allows for frequent evaluations of chatbots during their evaluation cycle. We empirically validate our claims by applying \emph{Spot The Bot} to three domains, evaluating several state-of-the-art chatbots, and drawing comparisons to related work. The framework is released as a ready-to-use tool.


Introduction
Evaluation is a long-standing issue in developing conversational dialogue systems (i.e., chatbots). The underlying difficulty in evaluation lies in the problem's open-ended nature, as chatbots do not solve a clearly-defined task whose success can be measured in relation to an a priori defined ground truth. Automatic metrics have so far failed to show high correlation with human evaluations (Liu et al., 2016;Lowe et al., 2017;Mehri and Eskenazi, 2020). Human evaluation approaches are mainly classified according to the following: single-turn vs. multi-turn evaluation, and direct user evaluation vs. expert judge evaluation. Single-turn analysis is usually performed by a human judge that rates a single response of the bot to a given context, whereas multi-turn analysis is often performed by a user that interacts with the bot and rates the interaction. Single-turn ratings disregard the multiturn nature of a dialogue (See et al., 2019). Although more and more multi-turn evaluations are performed, most of them are based on human-bot conversations, which are costly to obtain and tend to suffer from low quality (Dinan et al., 2020a). The instructions to be followed by annotators are often chosen ad-hoc and there are no unified definitions. Compounded with the use of often criticized Likert scales (Amidei et al., 2019a), these evaluations often yield a low agreement. The required cost and time efforts also inhibit the widespread use of such evaluations, which raises questions on the replicability, robustness, and thus significance of the results. In this work, we present the Spot The Bot framework, a cost-efficient evaluation methodology that can be used to rank several bots with regard to their ability to disguise as humans. It works as a multiturn-based evaluation with human judges. Spot The Bot is based on two observations: First, chatbots are trained on conversations between humans, and thus, they should be evaluated regarding their ability to mimic human behavior. Second, the longer a conversation is, the more likely it is that a bot exhibits non-human-like behavior. Spot The Bot works by generating conversations between bots, then mixing these bot-bot conversations with human-human conversations and letting human judges decide for each entity in the conversations if it is a human or a bot. The conversations are rated at different points in time, which introduces the time-dependent component. This setting allows for two different analyses: a ranking based on pairwise comparisons of bots, and the application of the Survival Analysis, which computes the survival rate for each bot at different conversation lengths. Furthermore, the human judges annotate the entities with respect to more fine-grained features, which can be chosen based on characteristics that the bots are expected to exhibit (e.g. fluency or informativeness). The Survival Analysis further allows to pin down the features that contribute to a dialogue system's survival, enabling interpretable results. We show that our framework produces reliable, repeatable results, while being quicker and more cost-effective to run than related approaches, as it does not rely on human-bot conversations and generally requires fewer annotations. Furthermore, we show that disagreement between human annotators can be interpreted as a feature of a system's performance, rather than a weakness in the evaluation approach. We apply the framework to three well-known domains and common baselines and state-of-the-art systems to produce a stable ranking among them. We release the framework as a ready-to-use tool for evaluating dialogue systems into which different systems can be plugged and compared 1 .

Related Work
There exist various methods to evaluate dialogue systems, both automatic and human-based, but no single evaluation metric is widely agreed upon in the scientific community (Deriu et al., 2020). Automatic evaluation metrics for chatbots are known to correlate poorly with human ratings (Liu et al., 2016;Lowe et al., 2017;Mehri and Eskenazi, 2020), so we focus on human-based approaches, which can be classified in two dimensions: 1) single-turn vs. multi-turn approaches, and 2) approaches where the dialogue systems are judged by the user directly (interactive) or where judgments are made by objective experts, who do not participate in the dialogue (static). Single-turn Static Evaluations. Evaluations based on a static context and a single response from the dialogue systems are widely adopted. Usually, the rating is performed by expert raters that read the response of one or more dialogue systems to a static context and rate the responses (Galley et al., 2018). Alternatively, the responses of two bots can be compared directly to choose a preferred answer (Li et al., 2016). While being relatively time and cost-efficient, single-turn evaluation fails to capture the conversation's quality as a whole. A system that tends to produce repeated answers can obtain a high single-turn score, albeit a low multi-turn one (See et al., 2019). Some authors also report poor inter-annotator agreement (Ghandeharioun et al., 2019).
Human-Bot Conversations. In order to perform interactive multi-turn evaluations, the standard method is to let humans converse with a chatbot and rate it afterward (Ghandeharioun et al., 2019), typically using Likert scales (van der Lee et al., 2019). The ConvAI2 challenge (Dinan et al., 2020b) and the Alexa Prize (Venkatesh et al., 2018) applied this procedure. Apart from the high cost of collecting human-bot conversations, this approach puts a high cognitive strain on humans, as they have to perform several tasks at once (Schmitt and Ultes, 2015). Besides, it is not always possible to get sensible conversations with bots, making it hard to get high-quality conversations. In fact, in the ConvAI2 challenge, half of the collected human-bot conversations were discarded due to their low quality (Dinan et al., 2020b). Finally, Likert scales are known to suffer from high annotation variance (Ghandeharioun et al., 2019), require normalization a posteriori, are prone to order effects and are less reliable than ranking-based ratings (Amidei et al., 2019b).
Self-talk. Recently, using self-talk dialogues, i.e., dialogues where a bot talks to itself, gained traction as a cost-effective basis for evaluation. This idea is closely related to user simulations used to evaluate task-oriented systems (Schatzmann et al., 2006). Ghandeharioun et al. (2019) and Deriu and Cieliebak (2019) use self-talk to produce automatic evaluations. In ACUTE-EVAL , the authors propose to let humans evaluate self-talk dialogues. Since self-talk does not allow direct comparisons between bots, the authors let humans read two self-talk conversations side-by-side and rate them with respect to various Figure 1: Overview of the Spot The Bot process for one conversation. 1: A bot-bot conversation is segmented into different lengths (e.g. 2, 3, and 5 exchanges). 2: These segments are shown to distinct sets of annotators who judge whether each entity is a bot. 4: The winner is determined for each annotated segment and survival statistics are updated. This process is repeated for all conversations between the competing bots.
features. This increases the cognitive complexity of the annotation task. Furthermore, the resulting ranking of the bots is per criterion, whereas our method produces one ranking and can optionally incorporate annotations of features that yield interpretability of the results.
Turing Test. Spot The Bot is reminiscent of the Turing Test (Turing, 1950), as the dialogue systems are evaluated based on their ability to mimic human behavior. The Turing test served as a useful mental model for understanding what machine intelligence might mean. However, it has also been criticized as a way to identify intelligence in NLP systems. Bender and Koller (2020) argues that a system may fool a user into believing it is human, and yet this does not prove that the system understands the meaning of the conversation they are having. In our approach, we claim that failing the test is a valid indicator to discriminate among bots. In fact, we presume that eventually all bots will fail the test, and we collect a time component to record the time it takes for a bot to be detected.

Spot The Bot
In this section, we first provide an overview of the Spot The Bot framework and then describe the evaluation process's individual steps.

Overview
Spot The Bot employs a tournament among chatbots to determine which performs the best at mimicking humans' conversational behavior. To measure the success of each bot, human crowdworkers are shown conversations between two competing bots at a time, mixed with conversations between two humans. The crowdworkers' task is to determine for each entity in a conversation whether it is a human or a bot (or whether the crowdworker is unsure). The bot that is most frequently annotated as being human wins the tournament. Figure 1 provides an overview of the process for one conversation. There are different use cases for Spot The Bot, e.g., when a novel dialogue strategy is to be compared against existing ones or if a set of chatbots is to be ranked in the context of a shared task. On top of returning a ranking, Spot The Bot employs the Survival Analysis, which introduces a time aspect into the evaluation and provides insights into how different features correlate to the bots' ability to pass as a human. Formally, assume a pool of b bots {B 1 , ..., B b }, which is to be ranked. For each pair of bots, a set of conversations is sampled by letting the bots talk to each other, where S ij denotes the set of conversations between bots B i and B j . Each conversation is defined as a sequence of exchanges e 0 , ..., e N , where each exchange consists of two turns: e i = {t e i 0 , t e i 1 }, one for each entity.
Segmentation. The more exchanges there are in a conversation, the more likely it is that a bot gets recognized as such. Thus, we show different segments of the conversation to the crowdworkers. A segment is defined as the first k exchanges of the dialogue: S k ij = e 0 , ..., e k . Thus, an annotator only sees the first k exchanges of the conversation. 2 Each segment of the same conversation is rated by a different annotator to avoid that one annotator sees parts of the same conversation multiple times, which would bias the rating. We choose different segment lengths since we cannot know a priori which length is sufficient for the different bots to be recognized as such.
Human Conversations. We add conversations among humans to the pool of conversations that are to be rated. The human conversations are sampled from the training set used to train the dialogue systems in the respective domain. The results of the annotations of the human dialogues establish an upper bound for the evaluation. Also, they are meant to prevent annotators from concluding that all entities are bots. 3 Annotation. The annotation procedure works in two steps: First, the annotators have to decide for each entity in a conversation segment if it is a bot or a human. Second, to correlate the outcome to various characteristics of a bot, the framework allows rating specific features (e.g., fluency or appropriateness). The framework then measures the influence of these features on the survival time of the bots, which serves as an explainability component (cf. Sections 3.3 and 4.2). Features. We chose three features: sensibleness, specificity (Adiwardana et al., 2020), and fluency. The first two are shown to capture the core conversational behavior of answering sensibly and not with illogical statements while being specific to the conversation's given context. The third feature states if the utterances are grammatically correct and fluent. The features are rated by preference ranking, that is, the annotator states which of the two entities performed better with respect to the features.

Ranking
We define a win function for the annotations of the pairwise, direct conversations between two bots. The outputs of the win function are aggregated to determine the overall winner of the tournament.
Win Function. Each annotation at each segment annotator behavior, cf. Appendix B. 3 We investigated if annotators realize that conversations are either between bots or humans by looking at ratios of conversations where both entities are labeled identically, but found no evidence that this happens more often than by chance. length S k ij = e 0 , ..., e k of a conversation constitutes the result of one annotation applied by one crowdworker, individually labeling each of the two entities as either bot, human, or unsure. The winner of segment S k ij under a crowdworker's annotation is determined by the following ordering of the labels: human > unsure > bot. That is, if bot B i is assigned the label human and bot B j has label bot or unsure, B i has won the segment. 4 Similar to Bojar et al. (2013), we define a win rate of B i against B j to aggregate the wins from all segments of all annotations stemming from conversations between bots B i and B j , as: Ranking. To create the ranking, we follow the approach by Dušek et al. (2018), where the ranking is generated by the TrueSkill (Herbrich et al., 2006) algorithm based on the win rate, and significant differences in performance are determined by bootstrap sampling. The result is a ranked set of clusters, where each cluster is composed of entities that do not have a significant difference in performance.

Survival Analysis
While pair-wise win rates are well-suited to provide a relative ranking among a pool of bots, it does not serve as an absolute evaluation of a single bot's ability to disguise as a human. Also, the conversations' segmentation introduces a time component, which we leverage to investigate our intuition that bots are more likely to reveal themselves in longer conversations. In our evaluation, a bot that is able to disguise in long conversations can be said to be most successful. Thus, we complement our evaluation with Survival Analysis. Survival Analysis estimates probabilities for the occurrence of an event at different points in time.
It has a long history in the medical domain, where it is used to estimate the effectiveness of different treatments (Li and Ma, 2013). In engineering disciplines, it is applied to estimate the time to failure of machine components (Eyal et al., 2014). In our case, we are interested in the time, corresponding to the number of exchanges, until a dialogue system is spotted as such. In addition, Survival Analysis allows us to correlate finer-grained characteristics to the survival probability, which allows us to inspect which of the annotated features impact a bot's survival.
We interpret the annotation data as such: the spotted event occurred if the system was annotated as "bot" and it survived if it was annotated as "unsure" or "human". Let k be the number of exchanges in the annotated conversation segment, meaning that each dialog system produced k outputs. If the dialog system was not spotted, we know it survived for at least k exchanges. This is a so-called rightcensored data point. If the dialogue system was spotted as such, we cannot tell the exact number of exchanges it took for an annotator to spot it, meaning it could have taken less than k exchanges. We thus record that the spotting event happened in the interval (0, k], a so-called interval-censored event. From this data, we can get non-parametric estimates of the survival function of the different systems per domain (Turnbull, 1974). To check whether these differences are significant, we apply a generalized log-rank test (Zhao and Sun, 2004). We use the Cox Proportional Hazards Model (Cox, 1972) to study the influence of the features outlined in Section 3.1 on the time before the systems are spotted. 5

Experiments
Domains. We apply Spot The Bot to three widely used domains for conversational dialogue systems: Dailydialog (Li et al., 2017), Empathetic Dialogues (Rashkin et al., 2019), and PersonaChat (Zhang et al., 2018). For each domain 6 , we prepared a pool of bots to be ranked and analyzed. For each pair of bots, we sampled |S ij | = 45 conversations. For this, we seed the conversations by using the first exchange of a conversation in the test set, which is sampled at random. Although there exists a probability that the bots resample parts of a conversation, we did not find evidence of this happening. In fact, only 2% of all sampled conversations contain an exchange, which can be found in the training material. For the annotation task, we recruited paid crowdworkers from Amazon Mechanical Turk (AMT). To avoid that, the results are biased towards the performance of a few crowdworkers, we designed a Human Intelligence Task as a batch of 20 conversations, and each worker was only allowed to work on three batches. We designed the batches so that two segments of the same conversations never appear in the same batch, and each batch contains different segments of different conversations.
Segmentation. The segment lengths are based on the lengths of the dialogues in a domain. Since we add human conversations of the training set to be rated, the sampled dialogues should adhere to their lengths. PersonaChat and Dailydialog have longer conversations; thus, we used segments of 2, 3, and 5 exchanges. The Empathetic Dialogue domain has shorter dialogues; thus, we used segment lengths of 1, 2, and 3 exchanges.
Dialogue Systems. For each domain, we prepared a pool of dialogue systems to be ranked. If applicable, we reused existing systems. In order to assess the performance of Spot The Bot regarding weak models, we trained a small sequence-tosequence model (DR) for only 3 epochs, which returns mostly general answers. For the Dailydialog domain, we trained all bots in the pool using Par-lAI as there were no pre-trained models available.
To leverage the recently developed language models, we fine-tune a GPT-2 (GPT) model (Radford et al., 2018), and a BERT-Rank (BR) model. Additionally, we train a sequence-to-sequence model (S2) with attention to compare the language models to previous state-of-the-art approaches. Together with the DR model, the pool consists of b = 4 systems. For the Empathetic Dialogues, we prepared the same pool of models as in Dailydialog.
Since the recently developed Blender model (Roller et al., 2020) is trained on the Empathetic Dialogue dataset as well, we add the pre-trained version to the pool (BL). For the PersonaChat domain, we mostly reuse the openly available systems of the ConvAI2 challenge (Dinan et al., 2020a), namely, Lost in Conversation 7 (LC) and Huggingface 8 (HF), which were the top-rated dialogue systems in the ConvAI2 challenge (Dinan et al., 2020a), as well as KVMemNN (KV), which served as the baseline. We also add the Blender model, which is also trained in this domain. In order to have more retrieval based systems, we train a BertRank (BR) model. Together with the DR model, the pool consists of b = 6 different dialogue systems.  mains, which is due to its repetitive nature, which is exposed over the course of a dialogue. In the Dailydialog and the Empathetic Dialogues domains, the GPT2 and the BR models perform equally, i.e., they end up in the same cluster. In both domains, systems using pre-trained language models outperform the S2 model, which is learned from scratch, which aligns with the expectation of related findings. The BL model outperforms all other models in both the PersonaChat and Empathetic Dialogues domains, which is in line with the results presented by the authors of the Blender model (Roller et al., 2020). Furthermore, the LC model is ranked very highly. This corresponds to the findings of the Con-vAI2 challenge (Dinan et al., 2020a). However, in Spot The Bot, the KV is ranked much higher than the HF model, which is not in line with the ConvAI2 evaluation.   Figure 2 shows the survival functions for the three domains. The survival rates produce the same rankings as those from pairwise win rates reported in Table 1, except for the Empathetic Dialogues domain, where GPT and BR switch places. Importantly, the distinction between these two is not significant in any of the rankings. Further non-significant differences within the Survival Analysis are S2 and DR in the Empathetic Dialogues domain, BR and S2 in the Dailydialog domain, and LC and KV in the PersonaChat domain. All other pairwise comparisons of survival curves are significant with p < 0.05 after correction for multiple comparisons.

Survival Analysis
Feature Influence. For each of the three featuresfluency, specificity, and sensibleness -annotators have to specify whether one entity performed better, the same, or worse than the other. We encode this information as 1, 0, and −1 respectively and fit a Cox proportional hazards model (Cox, 1972) for every system independently with the features as covariates.
The numerical entries in Table 2 refer to the perfeature win-rate of each bot, which is computed analogously to Equation 1 using the feature annotations directly. Bold entries in Table 2 show which features have a significant influence on the system being spotted. All significant effects go in the intuitive direction, meaning that a higher feature value leads to longer survival. For example, for the DR model, the fluency feature is significant across all three domains, and together with its low fluency win rate, we can deduce that it is often spotted due to its low fluency. Sensibleness seems to be an important feature across the board, meaning that in general, bots can be spotted due to inappropriate, nonsensical answers or hide if they respond in a suitable manner. Interestingly, specificity seems to be mostly unimportant, which could be due to either the bots not being noticeably unspecific, or it being an irrelevant feature for the chosen domains.

On Inter-Annotator Agreement
The robustness of the evaluation of chatbots is often hampered by inter-annotator agreement (IAA) (Gandhe and Traum, 2016). Measuring and reporting IAA is not yet a standard practice in evaluating chatbots (Amidei et al., 2019a), and producing annotations with high IAA on open-domain conversations is prone to be impeded by subjective interpretation of feature definitions and idiosyncratic annotator behavior (Bishop and Herron, 2015). In our setting, annotator disagreement on a bot's human-like behavior can be interpreted as a feature of a bot's performance: A bot that manages to fool one of two annotators into believing it is human can be said to have performed better than a bot that does not manage to fool any annotator. To analyze the annotator agreement in this light, we calculate per bot and label the percentage of cases where both annotators annotate the label if one of them does. Given three labels (human, bot, unsure), the chance for random agreement is 0.33. The results averaged over all investigated domains and segment lengths per bot, are shown in Table  3. 9 The results confirm that the bots that rank high  based on win rates and in the survival analysis (BL, GPT, LC) obtain the highest agreement on the human label and lowest agreement on the bot label. Conversely, the DR system obtains the highest agreement when being identified as a bot, and lowest when it is perceived as a human. This analysis suggests that our experiments' results do not stem from a random agreement between the annotators, i.e., the annotations of the best and worst-performing systems show agreement distinctly higher than chance regarding the respective labels.

On Reliability
One key requirement for an evaluation procedure is that repeated executions of the procedure result in the same outcome. We measure how many pairwise conversations between two bots are needed to guarantee a stable ranking. That is, what is the lower bound to |S ij | so that the ranking is stable. For each |S ij | ∈ {3...45}, we randomly sample |S ij | conversation for each pair and compute the ranking. We repeat this subsampling procedure 1000 times and measure the minimum |S ij | that guarantees the same ranking in at least 95% of cases. Figure 3a (a) Stability Experiment.
(b) Leave-one-out Experiment. Figure 3: Ranking stability experiments.The x-axis denotes the number of pairwise conversations between two bots. The y-axis denotes the rate at which the same ranking is achieved across 1000 repetitions. The horizontal line denotes the 95% mark. In the lower Figure, we show the experiments for the PersonaChat domain, when leaving one system out.
shows for each |S ij | ∈ {3...45} the proportion of times in which the most frequent ranking occurred. For the Dailydialog domain, |S ij | = 33 pairwise conversations are enough to guarantee a stable ranking. In the other two domains, this value is reached with over 40 pairwise dialogues. A more in-depth analysis reveals that ranking stability depends on the significance of pairwise comparisons. For instance, in the PersonaChat domain, the KV and LC systems are not significantly different, which leads to two different rankings depending on the subsampling: in the first, KV and LC are in the same cluster, and in the second, LC and KV are in separate clusters, with LC being on top. Thus, removing either of them from the pool would yield a more stable ranking. To investigate this further, we applied a leave-one-out stability analysis. More precisely, we applied the analysis on B \ {sys i }, where sys i ∈ B. Figure 3b shows the result of the leave-one-out stability analysis. When leaving one between LC or KV out, the stability is achieved with 25 pairwise dialogues. When removing one of the other systems, the stability is reached with at least 40 dialogues. Thus, the number of pairwise bot-bot chats needed for Spot the Bot evaluation depends on the pool of bots to be evaluated and should be determined empirically.

On Time Efficiency
Evaluation methods, which are costly and take up a long time, slow down the development cycle of dialogue systems.   For the Empathetic Dialogues, it is at 18 seconds, which is due to the shorter dialogues. We compare this to the time to create conversations between humans and bots. We recruited three dialogue system experts from our lab to interact with the systems. Each expert created 5 conversations with each system. The average times do not take into account the time needed to instruct the experts. For the Dailydialog and Empathetic Dialogues domains, it takes over 2 Minutes per conversation. For PersonaChat, the time increased to almost 4 minutes. Similarly to our experts, the average time for a human-bot conversation in the wild evaluation of the ConvAI2 challenge 10 also lies at 4 minutes 11 . Considering the 100 dialogues per system used in ConvAI, the evaluation time would be 2,000 minutes per system. In Spot the Bot, 40 annotations times 24 seconds mean 16 minutes per pair of systems. Assuming a comparison between 5 systems, an approach based on human-bot annotations such as ConvAI would require 20 thousand minutes, while Spot the Bot would do with 0,16 thousand minutes 12 . Concerning other methods based on self-talk, ACUTE-EVAL did not report the time per annotation, but they reported the time required to achieve significant results in PersonaChat, which is close to 30 minutes. Our method requires only 16 minutes (with 40 annotations). Thus, Spot The Bot increases the annotation speed while reducing the human raters' mental strain.

Conclusion
In this work, we introduced Spot The Bot, a robust and time-efficient approach for evaluating conversational dialogue systems. It is based on conversations between bots rated by humans with respect to the bots' ability to mimic human behavior. We show that Spot The Bot yields robust and significant results while reducing the evaluation time compared to other evaluation frameworks. A team of researchers who would like to benchmark their system against four competing chatbots could do that for the cost of fewer than 3 hours of crowdsourced annotations. Spot the Bot facilitates developers making real progress based on frequent manual evaluations data, avoiding the use of noisy automatic metrics or once-in-a-year costly manual evaluations. We make the framework as well as the data publicly available. 12 The amount of time needed by ConvAI grows linearly with the number of systems, while Spot the Bot (and ACUTE-EVAL) would grow quadratically. A pool of five systems seems reasonable for a research team, but even for larger pools (up to 51 systems) Spot the Bot is still more efficient.  Figure 4 shows the annotation tool. The annotator is presented with a segment of the conversation, with the first i exchanges. In the first step, the annotator needs to decide for both entities separately if they are human or not. If it is not yet possible to decide, the annotator can choose to state that they are undecided. In the second step, the annotators are asked to state which of the two entities performs better with respect to three different features: fluency, sensibleness, and specificity with the following definitions: • Fluency: Which entities' language is more fluent and grammatically correct?
• Sensibleness: Which entities' responses are more sensible? If the answer seems confusing, illogical, contradictory, or factually wrong then it is NOT sensible.
• Specificity: Which entities' responses are more specific and explicit in the given context? An answer is specific if it can be given only in the current context.

B Gamification
As an alternative to the segmentation approach, we experimented with a gamified version of the annotation tool (see Figure 5). In this version, the annotators were presented with the first turn of the conversation. At each point in time, they could choose whether to open the next turn or decide for an entity. If both decisions have been made, the annotators had to decide for the three aforementioned features, which entity performs better. The task was framed as a game, and the annotators received feedback in the form of a leaderboard. The score was a combination of the correctness (were the entities classified correctly) and a turn-penalty. That is, the more turns they opened, the lower the score.
As an additional incentive, the winner was awarded a bonus payment. However, this approach resulted in unwanted behavior of the annotators. Some always decided after just one exchange, which leads to random annotations. Others opened the whole conversation first and then decided. To counteract these behaviors the tool needed a lot of fine-tuning, making the approach not reliable for practical use.

C Experimental Setup
All the systems which we used were trained using the ParlAI system. We used the available models for the Lost in Conversation system, Blender, Huggingface system, and the KVMemNN. The other systems were trained using the ParlAI training functionality with the following hyperparameters. We trained all the models for 30 epochs. For all the Bert-Rank experiments, we used the Bi-Encoder and optimized the last four layers due to GPU restrictions. The GPT2 models were trained with the standard-setting. Due to GPU restrictions, we used the small version of the GPT2 model. The sequence-to-sequence model was trained with two layers of GRUs (Cho et al., 2014), each with 512 hidden units. We used the general attention mechanism (Luong et al., 2015) and used the Fast-Text word-embeddings (Bojanowski et al., 2017  In Table 5, the win rates and rankings for the fluency feature are shown. For the PersonaChat domain, the ranking differs significantly from the bot detection, as KV, LC, BR, and HF are all in the same cluster. In Table 6 the win rates for the Sensibleness and Specificity Average (SSA) are shown. A sys-     We apply Spot The Bot on three different domains, which all are based on conversations between two humans. Thus, dialogue systems learn to imitate human conversational behavior. Personachat. PersonaChat (Zhang et al., 2018) contains dialogues between two humans, each of the conversation participants is given a predefined persona. The persona is a set of characteristics of a person (name, occupation, hobbies, etc.), and the goal of the conversation is to mimic the process of getting to know each other. Dailydialog. Dailydialog (Li et al., 2017) is a dataset that contains dialogues that occur in daily life situations. The data is crawled from English learning websites. Thus, the dialogues are better curated and more formal. Furthermore, the data is annotated with features that represent the emotion in the dialogue. For our experiment, we did not make use of these features. Empathetic Dialogues. Empathetic Dialogues (Rashkin et al., 2019) focuses on empathetic response generation. The dialogues occur between two persons that discuss a situation that happened to one of the participants. Thus, there are two types of participants: the speaker and the listener. The first describes the situation and their feelings about it, and the listener responds empathetically.  For each segment 2,3, and 5 the win-rate (WR) and the percentage of classification as humans (HP) are shown. In the last row the percentage of ties is shown.

F Segment Length Analysis
The intuition behind the segment length is that if the dialogue is too long, then most conversational dialogue systems will always be exposed as such. Contrary, if the dialogues are too short, there is too little information to discriminate between dialogue systems. Thus, having different lengths of conversations ensures that these extremes do not occur. The effect is shown in Table 8. For each dialogue system, the rate at which it is classified as a human is depicted for the three different segments. For each dialogue system, this rate goes down, which is in line with our intuition. Similarly, the rate of unsure classification is lower at later segments. In later segments, two phenomena occur. First, the number of ties increases, as most dialogue systems get exposed as such, the number of ties in the Dailydialog domain increases from 72% to 81%. Second, the difference between the win-rates increases. Better bots have a higher win-rate, and the lower-ranked bots get a lower win rate. However, the win-rates are less significant due to the high number of ties. For instance, the GPT model increases its win rate to 0.81, whereas the win rate for S2 decreases from 0.46 to 0.34.

G On Stability against weak Annotators
One drawback of Likert-scale based evaluation methods is that many annotations need to be removed due to unreliable annotators (Lowe et al., 2017). Spot The Bot shows that it is stable with respect to weak annotators. Since we can measure how often the annotators correctly classify an entity, we can rate the quality of an annotator. A random annotator would receive a correctness rate of 50%. Table 9 shows an overview of the annotators for each domain. DOMAIN Table 9: Overview of the annotator performance. The number of annotations (#Ann), the average correctness score (AVG. CORR), the average correctness score for the human-human conversations (AVG. HUM. CORR.), and the percentage of annotators that have a correctness score below 50% ( < 50%).
The average correctness score is significantly higher than random. For the Dailydialog and Empathetic Dialog domain, the rate of annotators, which achieved a rate below 50%, was below 10% of all annotators. For the PersonaChat domain, the rate is higher, which is due to the fact that stronger dialogue systems were in the pool of bots. The average correctness scores for predicting humans correctly is high for all domains. Hence, Spot The Bot proves to be stable against annotators with low scores.
When removing all annotators with scores below 75%, the rankings remain stable. Only the significance scores decrease as a large number of dialogues gets removed. This lies in contrast to the gathering of conversations between humans and bots, which must be strictly supervised. For instance, the dialogues gathered in the wild evaluation of the ConvAI2 challenge were not usable. In fact, we applied Spot The Bot on these conversations, and the humans were rated as bots in 45% of the cases.