Weighting Model Based on Group Dynamics to Measure Convergence in Multi-party Dialogue

This paper proposes a new weighting method for extending a dyad-level measure of convergence to multi-party dialogues by considering group dynamics instead of simply averaging. Experiments indicate the usefulness of the proposed weighted measure and also show that in general a proper weighting of the dyad-level measures performs better than non-weighted averaging in multiple tasks.


Introduction
Entrainment is the tendency of speakers to begin behaving like one another in conversation. The development of methods for automatically quantifying entrainment in text and speech data is an active research area, as entrainment has been shown to correlate with outcomes such as success measures and social variables for a variety of phenomena, e.g., acoustic-prosodic, lexical, and syntactic (Nenkova et al., 2008;Reitter and Moore, 2007;Mitchell et al., 2012;Levitan et al., 2012;Lee et al., 2011;Stoyanchev and Stent, 2009;Lopes et al., 2013;Lubold and Pon-Barry, 2014;Moon et al., 2014;Sinha and Cassell, 2015;Lubold et al., 2015). One of the main measures of entrainment is convergence which is the main focus of this paper. Within a conversation, convergence measures the amount of increase in similarity of speakers over time in terms of linguistic features (Levitan and Hirschberg, 2011).
However, because multi-party interactions are more complicated than dyad-level interactions, it is not clear that the contribution of all group members should be weighted equally. For example, to account for participation differences, Friedberg et al. proposed a weighting method based on the number of uttered words of each dyad (Friedberg et al., 2012), although this did not yield performance improvements compared to simple averaging. Rahimi et al. (Rahimi et al., 2017b) provided examples of group-specific behaviors that were not properly quantified using simple averaging. While this case study nicely identified potential problems with prior measures, their observations were only based on a few example dialogues and no solutions were proposed.
In this paper, we propose a new weighting method to normalize the contribution of speakers based on group dynamics. We explore the effect of our method, participation weighting, and simple averaging when calculating group convergence from dyads. We conclude that our proposed weighted convergence measure performs significantly better on multiple benchmark prediction and regression tasks that have been used to evaluate convergence in prior studies (De Looze et al., 2014;Lee et al., 2011;Jain et al., 2012;Rahimi et al., 2017a;Lee et al., 2011).

Convergence for Multi-Party Dialogue
The convergence measure that we extend in this paper is adopted from prior work. Originally, convergence between dyads (Levitan and Hirschberg, 2011) was measured by calculating the difference between the dissimilarity of speakers in two nonoverlapping time intervals. If the dissimilarity in the second interval was less than in the first, the pair was said to be converging.
Extending this work, multi-party convergence (Litman et al., 2016) was measured using Non-Weighted (NW) averaging of each pairs' convergence, as shown in Equations 1 and 2: GroupDif f t corresponds to average group differences calculated for linguistic feature f in time interval t for all pairs (i,j). The convergence is the difference between GroupDif f s in two intervals.
In the next subsections, we introduce two weighted variations of these formulas: a baseline based on participation ratios (Friedberg et al., 2012), and a method based on group dynamics.

Weighting Based on Participation
The idea behind this approach is that the weights for speakers that may have talked very little should be reduced. In prior work on multi-party lexical entrainment (Friedberg et al., 2012), speaker participation was measured by number of uttered words; the participation ratios of speaker pairs were then used as the weights.
Since our work focuses on acoustic-prosodic entrainment, we measure speaker participation by amount of speaking time. The Participation Ratio (P R) of each speaker in a given temporal interval is their total speech time divided by the duration of the interval including silences. Speech and silence periods are automatically annotated using Praat (Boersma and Heuven, 2002). The Participation-based Weighted (PW) average of convergence for all pairs p in a group is then computed as follows: N um p indicates number of pairs, and Participation Ratio for a pair, P R p , for the two intervals is the sum of P Rs for both speakers and in both intervals. Finally, convergence for pair p = (i, j) and for two disjoint intervals t 1 and t 2 is calculated as in Equation 4: Figure 1: A group in which all speakers except Speaker2 are converging to each other.

Weighting Based on Group Dynamics
Although participation-based weighting decreases the contribution of less active speakers when calculating group convergence, it does not take group convergence dynamics into account. Rahimi et al. (Rahimi et al., 2017b) argue that it might instead be better to decrease the contribution of speakers whose convergence behaviors differ from the rest of the group (e.g., Speaker2 in Figure 1). To tackle this issue, we use weighting to decrease the contribution of outlier speakers. In particular, we propose that the weight for a speaker should be the percentage of individuals who have the same convergence behavior as the speaker. Equation 5 defines our proposed Group Dynamic-Based Weighted (GDW) convergence measure: G is a set including three categories: G = {Converging, Diverging, M ixedBehavior}, g is a set of all individuals who belong to a category in G, |N | is the number of all speakers in the group, and |N um pair | is the number of pairs. Consider the example in Figure 1. There are 12 pairs (6 unique pairs since convergence is a symmetric measure). Each speaker is in three unique pairs with the other three members of the group.
If all conversational pairs that a speaker is involved in have positive convergence values, the speaker is converging to the group and has the Converging category. If all involved pairs have negative value, the speaker is diverging from the group. Else, the speaker has a mixed-behavior.
The weight for each category is the number of speakers who have corresponding behavior normalized by the group size. For example, in a group where all members diverge from each other, the weights will be: converging = 0, diverging = 1, and mixedBehavior = 0. For the group in Figure 1, weights are: converging = 0, diverging = 1/4, and mixedBehavior = 3/4. So, the group convergence for this example is as follows, where C(i) is shortened for sum of pair convergences for speaker i:

Data
To evaluate the utility of weighting based on group dynamics, we measure acoustic-prosodic convergence in the Teams Corpus (Litman et al., 2016). The corpus includes audio files for 62 teams of 3 or 4 individuals playing a cooperative board game in two sessions. First games (Game1) take significantly longer than second games (Game2) (27.1 vs. 18.4 minutes, p < .001) and are in chronological order. The teams are disjoint in participants. We break each game into four equal intervals 1 (including silences) and choose the first and last intervals to compute convergence for eight acousticprosodic features: maximum (max), mean, and standard deviation (SD) of pitch; max, mean, and SD of intensity; local jitter 2 ; and local shimmer 3 . The features are extracted from each of the first and last intervals for each speaker in each team. Individually taken self-reported pre-and postgame surveys are available for both sessions, including: (1) favorable social outcome measures (perceptions of cohesion, satisfaction, potency/efficacy and perceptions of shared cognition), and (2) conflict measures (task, process, and relationship conflicts). Since favorable measures have high correlations, we z-scored each separate outcome and averaged these scores to make a single omnibus favorable group perception scale and then averaged them for each team to create a teamlevel Favorable measure. Since process conflict was the only conflict measure that could be split at the median without making arbitrary choices 4 , we z-scored the process conflict and averaged it in the 1 Any method of breaking the games to compare two disjoint intervals can be used. 2 The average absolute difference between the amplitudes of consecutive periods, divided by the average amplitude.
3 The average absolute difference between consecutive periods, divided by the average amplitude. 4 The median split is required for our classification tasks. groups to construct a team-level Process Conflict measure. Favorable and Process Conflict will be used to evaluate the quality of the different convergence measures from Section 2.

Experiments and Discussion
Our experimental evaluations use two tasks that have been used for convergence measure evaluations in previous studies (De Looze et al., 2014;Lee et al., 2011;Jain et al., 2012;Rahimi et al., 2017a;Lee et al., 2011). Predicting Social Outcomes: Our first task examines how the NW, PW, and GDW measures of acoustic-prosodic convergence (independent variables) relate to the social outcome measures (dependent variables) from Section 3. This is similar to prior studies which have evaluated convergence in terms of predicting outcomes Lee et al., 2011;Rahimi et al., 2017a). We hypothesize that the group-dynamic weighted convergence measure will outperform the nonweighted and participation-based measures.
First, we train a hierarchical multiple regression with each of the three groups of convergence measures, added once in the first level and the other time in the second, to measure if the second level predictors significantly improve the explanation of variance. We only keep predictors with significant coefficients when presenting the models. 5 For Process Conflict, the results show that all NW, PW, and GDW predictor groups are as good as each other; no matter which group is entered in the first level, the predictors in the second level do not significantly improve model fit.
For Favorable, neither PW nor NW in the second level significantly improves performance. However, Table 1 shows that adding the GDW measures at the second level significantly improves a model with only NW features at the first level. The amount of variance explained in Model 2 is significantly above and beyond Model 1, ∆R 2 = 0.048, ∆F (2, 119) = 3.179, p = 0.045. The reverse order, GDW at first level and NW at the second level, shows that the improvement at the second level is not significant, ∆R 2 = 0.031, ∆F (2, 119) = 2.068, p = 0.131. These results indicate that the proposed weighted (GDW) convergence (for intensity max and SD) are the best  predictors of the favorable social outcome compared with the other two measures of convergence. Next, we reduce the task from regression to a binary classification by splitting the two social outcome variables at the median. We perform Leave-One-Out Cross-Validations (LOOCV) using a logistic regression (L2) algorithm and all eight acoustic-prosodic features to predict binary outcomes. The results in Table 2 show that the GWD model significantly 6 outperforms both PW and NW models to predict the favorable social outcome. In the prediction of process conflict, the PW model outperforms both NW and GDW models and its improvement over GDW is trending.
In sum, the results in both tables support our hypothesis for the favorable social outcome, where the proposed GDW convergence measure is a better predictor of the outcome. For process conflict, we do not see any significant difference.
Predicting Real Dialogues: The existence of entrainment should not be incidental. To evaluate this criteria, we use permuted versus real conversations as in (De Looze et al., 2014;Lee et al., 2011;Jain et al., 2012). We hypothesize that GDW will be the best convergence measure for distin-6 Corrected paired t-test was performed to address instance dependency from both games (Nadeau and Bengio, 2000).  For each of the 124 game sessions, we construct artificially permuted versions of the real dialogues as follows. For each speaker, we randomly permute the silence and speech intervals extracted by Praat. Next, we measure convergence for all the groups with permuted audios. We perform a leave-one-out cross-validation experiment to predict real conversations using the convergence measures. We examined several classification algorithms including logistic regression; linear SVM was the only one that showed significant results.
The "All" results in Table 3 show that none of the models significantly outperform the majority baseline. To diagnose the issue, we perform the prediction on each game separately. The proposed GDW model significantly outperforms other models for Game 1. However, for Game 2, none of the results are significantly different. One reason might be that convergence occurs quickly during Game 1, and there is not much convergence occurring at Game 2. Thus, there is no significant difference between permuted and not permuted convergence for any of the features during Game 2.

Conclusion
In this paper, we introduced a new weighted convergence measure for multi-party entrainment which utilizes group convergence dynamics to weight pair convergences. Experimental results show that the proposed weighted measure is more predictive for two evaluation tasks used in prior entrainment studies: predicting favorable social outcomes and predicting real versus permuted conversations. In future work we plan to apply the proposed weighted convergence measure to features other than acoustic-prosodic, e.g., lexical.