A framework for the automatic inference of stochastic turn-taking styles

Conversant-independent stochastic turn-taking (STT) models generally benefit from additional training data. However, conversants are patently not identical in turn-taking style: recent research has shown that conversant-specific models can be used to refractively detect some conver-sants in unseen conversations. The current work explores an unsupervised framework for studying turn-taking style variability. First, within a verification framework using an information-theoretic model distance , sides cluster by conversant more often than not. Second, multi-dimensional scaling onto low-dimensional subspaces appears capable of preserving distance. These observations suggest that, for many speakers, turn-taking style as characterized by time-independent STT models is a stable attribute, which may be correlated with other stable speaker attributes such as personality. The exploratory techniques presented stand to benefit speaker diarization technology, dialogue agent design , and automated psychological diagnosis .


Introduction
Turn-taking is an inherent characteristic of spoken conversation. Among models of turn-taking (Jaffe et al., 1967;Brady, 1969;Wilson et al., 1984;J. Dabbs and Ruback, 1987;Laskowski, 2010;Laskowski et al., 2011b), those labeled "stochastic turn-taking models" (Wilson et al., 1984) offer a particular advantage: they are independent of the meaning of just what a "turn" might be. This is felicitous, since researchers are in disagreement over the definition. Instead, stochastic turn-taking (STT) models provide a probability that a specific participant speaks at instant t, conditioned on what that participant and her interlocutors were doing at specific prior instants. Whether her speaking constitutes something that might be called a "turn" is not germane to the applicability of STT models.
In their most commonly studied form (Jaffe et al., 1967;Brady, 1969;Laskowski, 2010), STT models condition their estimates on a history that consists exclusively of binary speech/nonspeech variables; the extension to more complex characterizations of the past have been studied (Laskowski, 2012) but comprise the minority. In this binary-feature mode of operation, STT models ablate from conversations the overwhelming majority of the overt information contained in them, including topic, choice of words, language spoken, intonation, stress, voice quality, and voice itself, leaving only speaker-attributed chronograms (Chapple, 1949) of binary-valued behavior. This is a strength particular to STT models: they are language-, topic-, and text-agnostic, and therefore stand to form a universal framework for comparison of conversational behavior, where other methods would need to be extended to cross language, topic, and speech usage boundaries.
Given the paucity of information contained in chronograms, however, it is surprising that they have been efficiently exploited in the supervised tasks of conversation-type inference, participantrole inference, social status inference, and even identity inference. The current article aims to extend understanding of STT models in an unsupervised way, by starting from a theoretically sound distance metric between models of individual, interlocutor-contextualized conversation sides. In the space induced by these distances, experiments and analyses are performed which aim to answer a fundamental question: Do people behave selfconsistently, across disparate longitudinal obser-vations, in terms of their turn-taking preferences? (Self-consistency within conversations was studied indirectly in (Laskowski et al., 2011b).) To provide an answer, between-person scatter is compared to within-person scatter, and accounts are sought for both types of variability. The findings reveal that models of persons are in fact selfconsistent on average, and that, therefore, both (1) the persons they model are self-consistent, and (2) the modeling framework presented here is capable of capturing that self-consistency, while simultaneously differentiating among persons. The work has important implications for social psychology, diarization technology, and dialogue system design.

Data
The data used in this work was drawn from the ICSI Meeting Corpus (Janin et al., 2003), which consists of 75 multi-party meetings involving naturally occurring, spontaneous speech. It has been claimed that the meetings would have taken place even if they were not being recorded.
DATASET as defined here is limited to all 29 of the Bmr meetings, i.e. those held by the group of 15 researchers working on the Meeting Recorder project at ICSI. Not all 15 persons participated in every meeting; each of the 29 meetings was attended by an average of 6.8 persons. The total number of conversation sides in DATASET is 197. The distribution of sides per participant is shown in Figure 1. me011  fe008  me013  me018  mn014  fe016  me001  mn017  mn005  me051  me022  me025  me026  me028  Each meeting in the ICSI Meeting Corpus contains an interval of time (at the beginning or end of the meeting) marked as Digits, used for microphone calibration. This interval was excluded for the current purposes, as it does not involve conversation. Each recording was left with between 22.8 and 74.5 minutes of data, with an average of 48.4 minutes.

Chronograms
From each meeting C in DATASET, a speech/nonspeech chronogram (Chapple, 1949) was constructed, designated by Q. Q is a matrix whose entries are one of { , }, or equivalently {0, 1}, designating non-speech or speech respectively. Rows represent the K persons participating in the meeting, while columns represent 100-ms time frames covering its temporal support. The average Q in DATASET thus contained K = 7 rows and T = 29K columns.
The cell in row k and columnt t of every Q was populated, by a value of or , by inspecting the forced alignments to the manually transcribed speech attributed to the kth speaker of the corresponding meeting. The transcriptions, attributions, and alignments had been made available by ICSI in (Shriberg et al., 2004). A frame increment of 100 ms was chosen as in (Laskowski et al., 2011b) and (Laskowski et al., 2011a); this is shorter than the average syllable duration, ensuring that no speech is missed, but longer than the frame step of the recognizer used by ICSI for the forced alignment. This makes the models developed in the current work robust to imprecision in word start and end times.

Stochastic Turn-Taking Models
The models used in the current work are probabilistic generative models that, given a chronogram Q ∈ { , } K×T , provide the probability that its kth participant will speak during its tth frame. Participants are most commonly (Jaffe et al., 1967;Brady, 1969;Laskowski et al., 2011b) treated as conditionally independent (or "singlesource" in the terminology of (Jaffe et al., 1967)); namely, the probability of speaking at frame t for participant k is independent of what the other K − 1 participants do at frame t, but it is conditioned on the joint K-participant history. The history duration, in number of most-recent contiguous frames, is denoted henceforth by τ .
In multi-party conversation, the number K of participants varies from conversation to conversation, leading to a context of variable size. To eliminate this complication, when constructing or accessing the model describing the kth row of chronogram Q, the remaining K − 1 rows (representing the kth participant's interlocutors) are collapsed via an inclusive-OR operation, to provide a single "all interlocutors" row. This results in a conditioning history of τ frames of the kth participant, and τ frames of context describing whether any of the kth participant's interlocutors were speaking at instant t − τ (Laskowski et al., 2011b).
The above method yields a history duration which is independent of K, and lends itself easily to N -gram modeling. The elements of the conditioning history are marshalled into a onedimensional order, and counts are accumulated as elsewhere for N -grams. This results in a maximum-likelihood (ML) model p A (q|h) for a sequence denoted A, with q ∈ { , } and h the conditioning history. In (Laskowski et al., 2011b), such models were interpolated with lower-order (smaller-τ ) models (Jelinek and Mercer, 1980), yielding smoothed modelsp A (q|h). In the absence of smoothing, as in the current work, the order of the elements of the (2 × τ )-length history is unimportant, provided it is fixed.

Supervised Modeling
In supervised modeling, a model A is constructed from one or more conversation sides attributed to the same speaker, and then that model is applied to a conversation side B whose speaker is unknown. In this case, a commonly used score between generative model A and sequence B is the average negative log-likelihood of the sequence given the model, which is also known as the conditional cross entropy: where p B (h, q) are the ML joint probabilities observed in sequence B. Equation 1 is often normalized by subtracting the conditional entropy (Cover and Thomas, 1991), yielding the conditional relative entropy or conditional Kullback-Leibler divergence (Cover and Thomas, 1991): For example, in the context of stochastic turntaking models, Equation 1 was successfully used with zero-normalization of scores (Laskowski, 2014).

Unsupervised Modeling
In the unsupervised case, a score does not normally compare a sequence B to a model A, but rather a sequence A to a sequence B (or, alternately, a model trained on sequence A to a model trained on sequence B). Because of this symmetry, it is desirable for the score itself to be symmetric; the conditional Kullback-Leibler divergence in Equation 3 does not exhibit this quality and, additionally, is unbounded. It is therefore customary to compute the conditional Jensen-Shannon divergence (Lin, 1991), which for two equal-weight conditional probability models p A and p B is given by Here, p (q|h) is the "joint-source" (ie. A and B) model; (El-Yaniv et al., 1997) showed that for models of conditional probability, its form is namely that it is the linear interpolation of the two single-source models, with weights given by their relative probabilities of the occurrence of the context h: The Jensen-Shannon distance, a score which is both bounded and symmetric, is given by Table 1: Leave-one-out (LOO) modified-KNN classification accuracies, using Jensen-Shannon distances between STT models of individual conversation sides in DATASET. K specifies the maximal number of neighbors; τ is the number of 100-ms frames of conditioning history. Each frame contains 2 bits of information: whether the modeled-side participant was speaking, and whether any of that participant's interlocutors were speaking.

Modified Nearest-Neighbor Classification
A central goal of the current work is the determination of whether two sequences, produced by the same person in different conversations, are more proximate than are two sequences produced by two different persons. One answer to this question can be provided by classifying sequences based on their proximity, of which the formalization is known as K-nearest neighbor classification (Fix and Hodges, 1951). The input to the algorithm is a symmetric, zero-diagonal distance matrix D, whose entries are pair-wise distances.
Here, a modified version of the algorithm is employed. If the speaker g of the side being classified is known to have produced only N g − 1 other sides in the collection of sides under study, then K is limited to N g − 1 for that classification trial. The use of such side information may be perceived as unfair; however, the aim is diagnostic, and no effort has been made in the current work to normalize the distances in D for local density differences. In addition, it makes little sense to penalize an analysis for those trials whose speakers produced no other sides in DATASET (cf. Section 2). The results of such a diagnostic test can be usefully compared to the outcome of random guessing under the same circumstances.
An alternative approach, consisting of applying clustering to the distance matrix, was also tried; the results yielded similar (albeit more difficult to disentangle) results and are not presented due to space constraints.

Multidimensional Scaling
Finally, multidimensional scaling (MDS; cf. (Borg and Groenen, 2005) for example) was applied in an attempt to embed models in a lowdimensional space and to facilitate visual analysis. The experiments used the smacofSym() function (de Leeuw and Mair, 2009) implementation in R.

Results
For a given τ ∈ [1, 2, 3, . . . , 8], each conversation side q n of the N = 197 sides in DATASET was used to train a side-specific maximum likelihood (ML) model θ n . The distance between every pair of models was then computed using Equation 8, leading to a symmetric, zero-diagonal distance matrix D ∈ R 197×197 + .

Diagnostic Classification
D was then used within the modified K-nearest neighbor participant-identity classification framework described in Section 3.5. The achieved accuracies are shown in Table 1.
As can be seen, the highest accuracies are obtained for τ ∈ [2, 3, 4, 5] with K > 7, with an absolute maximum from among those explored of 60%, at τ = 3 and K = 15. This is considerably in excess of 11%, the accuracy   (Laskowski, 2014), that participant identities can frequently be inferred from STT models; the difference with (Laskowski, 2014) is that in the latter work, models were trained on same-person sets of sides in a training portion of the data, rather than on individual sides, and that the asymmetric conditional cross entropy (Equation 2, with zero-normalization) was used rather than Jensen-Shannon divergence (Equation 4).

Diagnostic Classification after Scaling
The computed pair-wise Jensen-Shannon distances lie in a space of unknown effective dimensionality; the determination of that effective dimensionality is one of the implicit aims of the current work. To this end, the distances were embedded in a fixed-dimensionality subspace, using multidimensional scaling (MDS) as described in Section 3.6. All 19306 pair-wise distances comprising D were then re-computed from the MDSderived positions, and the diagnostic experiment of Section 4.1 was repeated. The results for a 5dimensional subspace are shown in Table 2.
As can be seen, relative to Table 1, MDS to 5 dimensions actually increases the attainable classification accuracy, to 70% at τ = 5 and K = 17. This suggests that there is considerable noise in the distance estimates, and that scaling effectively collapses some of that variability. The accuracymaximizing number of dimensions, whose identification is beyond the scope of the current work, is expected to be specific to any particular data set. However, it is notable that for DATASET this "elimination of unwanted variance" occurs for the higher-complexity (τ > 2) models; distances computed using these are more likely to be noisy that those computed using simpler models, for fixed conversation-side durations. Since the τ = 8 context contains the τ = 5 context, this suggests that the duration of the conversations studied here, between 22.8 and 74.5 minutes, may be insufficient to infer robust long-conditioning-history models.
Similar experiments were performed after MDS scaling to each of {4, 3, 2, 1} dimensions. The results are not shown due to space constraints. A summary of the maximum achieved accuracy in each case is depicted in Figure 2.
The figure shows that with each reduction of dimensionality of the embedding subspace, by one additional dimension, the maximum achievable accuracy falls by an increasing amount. Although for a one-dimensional subspace the accuracy of 35% is still considerably above chance (11%), it is already (just) less than halfway to the accuracy achieved without scaling (60%).
At 3 dimensions, the accuracy of 58% is almost the same as that achieved without scaling; it occurs at τ = 6 and K = 17 (not shown). This suggests that the relative magnitudes of the distances are preserved in a continuous small-dimensional space, and may have implications for understanding what STT models actually learn. For example, each of the dimensions may be strongly correlated The accuracies are compared to the maximum accuracy achieved using unscaled distances ("orig") and random guessing with actual LOO priors ("rand").
with an independently measurable human trait or role trait. In that case, such traits could be used to index STT models, for both generation and recognition purposes in multi-party conversational settings.

Model Subspace Visualization
It is serendipitous that, for the data set under investigation, three dimensions suffice to yield a good approximation of the accuracy achievable without scaling. A three-dimensional space is considerably easier to inspect visually, and to understand, than are higher-dimensional spaces. Figure 3 shows the MDS-derived locations, two di-mensions at a time. The 197 datapoints, representing models of individual conversation sides, are seen to comprise a cloud with heterogenous, locally clumpy density. The determinant of the total scatter matrix, given these inferred positions, is 2.74 × 10 3 . The determinants of the between-class scatter matrix and the within-class scatter matrix, given the model positions shown in Figure 3, are 3.29 × 10 3 and 2.86 × 10 3 , respectively. It appears from these numbers that the variability between different-person sides is on average larger than the variability between same-person sides, which in turn suggests that people exhibit low variabilityeven across longitudinal spans of many monthsrelative to what differentiates them from others.

Intra-Person Variability
It is relevant to try to determine whether the variability observed among models of the same person are due to actual variability of behavior or to measurement error. One source of measurement error could be the relative duration of conversations, leading to unequally (under)trained models. Figure 4 depicts the five most frequent participants in DATASET, at the same positions as in Figure 3(a), with marker size indicative of the duration of observation.
It can be seen that, broadly, shorter-duration conversations yield models which lie at the periphery of the error ellipses. This indicates that -were conversations longer or models more par- simonious -the resulting error ellipses (shown unchanged from Figure 3(a) in Figure 4) may be tighter, and thereby even more discriminative. A second potential source of intra-person variability may be not just the duration of observation (i.e. the duration of conversation), but how talkative a person is during a specific conversation. Although the models employed here make no mathematical distinction between speaking and not speaking, in multi-party turn-taking the average participant speaks for only a minority of time, making speaking (versus not speaking) a distinctively marked behavior. Figure 5 is like Figure 4, but marker size is indicative of the amount of speech observed for each side. Figure 5 shows that points lying in the bottom right of the figure represent low quantities of speech per side, globally. This appears to be true for individual speakers separately, particularly for the top three most frequent participants (and me013 most markedly). Since the ellipses appear cigar-shaped, fanning out from the bottom right, these observations suggest that when given the opportunity to speak a lot, participant models "move" to the upper left where they may be even further apart. They also suggest that a quantity encoded in the plane of the first and second MDS dimensions ("DIM1" and "DIM2" in the figure) is the proportion of speech produced by each person, or their "talkativity".

Inter-Person Variability
A source of established (Laskowski et al., 2008) variability in turn-taking models trained using the ICSI Meeting Corpus is the relative seniority of participants within a group. (Laskowski et al., 2008) used the self-reported Education level. Figure 6 retains the topology shown in Figure 3(a), but markers represent the educational level of individual participants in DATASET. It can be seen that students (Undergrad and Grad) occupy exclusively the lower half in the diagram, while Postdoc and Professor are found predominantly in the upper half, but in separate clusters. Persons of type PhD exhibit no such leanings. Figure 6 suggests that education level is indeed discriminated by the STT-model topology inferred via MDS. (Laskowski et al., 2008) observed that despite the fact that persons of type Professor spoke a lot, they appeared to avoid overlap with persons of type Undergrad. Such tendencies are most likely the result of social roles within the organization, and not of educational level per se, but role and education level are probably very correlated in an academic setting. It may be tentatively concluded that the ("DIM 1","DIM 2") plane also encodes, in addition to each person's "talkativity" (cf. Subsection 5.1), their tendency to initiate and terminate talk in overlap. It should be noted that, unlike the measurement of intra-person variability, the measurement of inter-person variability is likely a function of the size of the group of people studied. As described in Section 2, the group considered here consists of 15 individuals, some of which participated in only a handful of conversations. For larger groups, it can be expected that -if models represent interaction styles -inter-person variability under a fixed model order and a fixed observation duration will decrease, since nothing a priori prevents multiple individuals from interacting using the same or similar-enough style. Since intra-person variability is independent of the number of other persons considered, it is expected to remain constant under group resizing. The ratio of the inter-person variability to the intra-person variability is therefore likely to decrease with increasingly larger group sizes, when the model complexity and observational duration remain constant.

Training Speaker-Independent Models
That within-person SST-model variability can be smaller than between-person variability, as discovered in the dataset used in the current study, has important consequences for training broad STT models, intended to be applicable to a wide variety of domains and conversational interaction styles. The results presented indicate that including more training data, without careful consideration of its interaction-style content, may bias the model towards the styles present in the training data and therefore away from the styles in test data -since they can be so different. In this sense, the results corroborate earlier, similar findings for domain and topic variability in language modeling within automatic speech recognition.

Potential Impact and Applications
Over and above the immediate recommendations for the training of STT models, the results obtained in the current study may have several consequences for at least three research areas.
An understanding of the contexts in which participants to conversation choose to vocalize can usefully inform the construction of speaker diarization systems. Current state-of-the-art diarization technology, as used in the transcription of far-field recordings of multi-party meetings, oversegments the temporal support of the recorded track and then performs agglomerative hierarchical clustering using spectral or voice-print similarity. The prior knowledge used in these systems consists of minimal duration constraints on intervals of single-party talk, as well as the assumption that each instant is associated with exactly one participant speaking. The detection of overlap (or of simultaneous vocalization by more than one speaker), where performed, is generally treated as a post-processing step. Information regarding consistent, participant-specific tendencies in the temporal deployment of talk -the subject of the current study -do not currently feature in any way in the assumptions or priors of today's diarization systems. Second, dialogue system design can benefit from the results presented, particularly those systems which are conversational and whose behavior is intended to be more natural than that of simple human-query-driven information portals. The confirmation that humans exhibit self-consistency in their temporal deployment of speech, which also makes them different from other people, means that the detection of their style and an orientation to it will result in better predictions, requiring fewer resolutions. If that orientation is perceivable to the human user, the system may appear to the user as more human itself. An additional dimension of human-likeness may be inadvertently communicated by the system if it has its own, self-consistent and differentiable style, which is syntonic with its designed conversational role.
Finally, the results in this study have bearing on the design of diagnostic tools for social psychology, the domain for which STT models were originally invented (Chapple, 1949;Jaffe et al., 1967). (Chapple, 1949) was concerned with the measurement of conversational traits correlated with work performance, whereas (Jaffe et al., 1967) treated clinical settings. A considerable amount of research in this area had been conducted in the 1970s and 1980s, primarily in the detection of traits or conditions. However, the models were first-order Markovian (corresponding to τ = 1 in the current work) and often relying on analysis frames as small as 20 ms. The findings presented here indicate that useful speaker-discriminating information is contained as far back as 500 ms (with frames of 100 ms and τ = 5, cf. Subsection 4.2), even when models are trained on single conversations which are as short as 22 minutes long. The obtained results may warrant a re-opening of earlier investigations into diagnostic tools for the health industry.

Conclusions
That people exhibit a degree of consistency in their conversational behavior agrees with common sense, and should not be particularly surprising. A number of earlier works have successfully correlated identity with turn-taking preferences (Jurafsky et al., 2009;Grothendieck et al., 2011). What the analyses in the current work show -and which is surprising -is that this consistency is present even in the very shallow representation implicit in the so-called stochastic turn-taking models. In this representation, words, boundaries, durations, and prosody are markedly absent; only the frame-level occurrence of partyattributed speech activity is captured, and a definition of "turn" is neither needed nor used. Specifically, results indicate that, for conversations whose duration is 40-minutes on average, longitudinally speaker-discriminative models can be learned for a conditioning history which is only 10 bits long: whether the modeled speaker, and any of their interlocutors, were speaking in each of the 5 most recent 100-ms frames. The current study has shown that under these conditions, for groups of 15 people like the ICSI Bmr group, the inferred models exhibit greater between-person variabil-ity than within-person variability. The converstants under study appear to have behaved selfconsistently, across disparate longitudinal observations, in terms of their turn-taking preferences.
The current experiments also demonstrated that a conversation-side embedding in three dimensions approximately recovers the Jensen-Shannon distances between 10-bit-context STT models. In this embedding, between-person variability was shown to be smaller for longer conversations, implying that over time people can be observed to converge on interaction styles which are even more self-consistent. Although it is premature to unambiguously ascribe meaning to each of the three dimensions obtained using the ICSI Bmr data, jointly they appear to encode: (1) the proportion of conversation-time spent talking; (2) the inclination to initiate and terminate overlap with others; and (3) role-specific behaviors exhibited by members of a hierarchy (with -in the current work -positions within that hierarchy closely correlated with self-reported education level).
The presented work suggests the possibility of inference of speaker-characterizing conversational interaction styles, as well as the indexing of such interaction styles by points in an embedding space consisting of only a few continuous dimensions. It has immediate bearing on the training of intentionally broad, speaker-independent STT models. Finally, the work has the potential to usefully impact the design of speaker diarization algorithms for multi-human conversation settings, of humanlike conversational dialogue systems, and of diagnostic software for the health industry.