No Gestures Left Behind: Learning Relationships between Spoken Language and Freeform Gestures

We study relationships between spoken language and co-speech gestures in context of two key challenges. First, distributions of text and gestures are inherently skewed making it important to model the long tail. Second, gesture predictions are made at a subword level, making it important to learn relationships between language and acoustic cues. We introduce AISLe, which combines adversarial learning with importance sampling to strike a balance between precision and coverage. We propose the use of a multimodal multiscale attention block to perform subword alignment without the need of explicit alignment between language and acoustic cues. Finally, to empirically study the importance of language in this task, we extend the dataset proposed in Ahuja et al. (2020) with automatically extracted transcripts for audio signals. We substantiate the effectiveness of our approach through large-scale quantitative and user studies, which show that our proposed methodology significantly outperforms previous state-of-the-art approaches for gesture generation. Link to code, data and videos: https://github.com/chahuja/aisle


Introduction
Spoken language has gained more traction in the past decade due to improvements in natural language understanding and speech recognition. With an eye on the future, technologies such as intelligent personal assistants (e.g. Alexa, Siri, Cortana) are likely to also include embodiment to take advantage of the non-verbal communication that people naturally use in face-to-face interactions. As a stepping stone in this direction, it is important to study the relationship between spoken language (which also includes acoustic information) and free form gestures (which go beyond just a pre-defined dictionary of gesture animations). In other words, how can we automatically generate human body pose (gestures) from language and acoustic inputs?
An important technical challenge in such a natural language processing task, is modeling the long tail of the language-gesture distribution (see Figure  1). If not addressed directly, computational models will likely focus on the common gestures (e.g beat gestures) as a way to improve precision at the cost of reduced coverage for less frequent words and gestures (Ginosar et al., 2019). Hence, when learning these models, we need to not only be accurate for gesture generation, but also handle coverage of both linguistic and visual distributions (Pelachaud, 2009;Kucherenko et al., 2019). In other words, we need models that can balance precision and coverage. Another technical challenge comes from the differences in granularity between language and gestures. Gestures can be triggered at the sub-word level; for example, by a change of intonation in acoustics. Thus, it is important to have sub-word level alignment between language and acoustics to generate the freeform gestures.
In this paper, we study the link between spoken language and free form gestures. As a first contribution, we propose Adversarial Importance Sampled Learning(or AISLe), an approach whose main novelty is to bring adversarial learning and importance sampling together to improve coverage of the generated distribution without compromising on the precision at no extra computational cost. As a second contribution, we introduce the use of neural cross-attention architecture (Vaswani et al., 2017;Tsai et al., 2019) for gesture generation conditioned on spoken language. This idea allows transformer blocks to help with subword alignment between language and acoustic signals. A third contribution is the extension of dataset proposed in Ahuja et al. (2020) with automatically Figure 1: A toy representation of data distribution p data as a histogram. Colours , , represent bins from the mode, heavy tail and long tail of p data respectively. The color coded envelope covering p data is the distribution of weights across bins (δy, δx) for the following resampling techniques: (a) No Resampling, (b) Static Resampling, and (c) AISLe. While p data is a multivariate distribution, we use a 1-dimensional histogram for the sake of demonstration. extracted transcripts for audio signals corresponding to 250+ hours of freeform gesture information and 25 speakers. Our experiments study the effectiveness of our proposed method with a focus on precision-coverage trade-off. These quantitative experiments are complimented with important subjective human studies as the englobing judges of the generation quality.

Related Work
Language and Speech for Gesture Generation An early study by  proposed the behavior expression animation toolkit (BEAT) that can select and schedule behaviors, such as hand gestures, head nods and gaze, which was extended by applying behavior decision rules to the linguistic information obtained from input text (Lee and Marsella, 2006;Marsella et al., 2013;Lhommet et al., 2015;Lhommet and Marsella, 2016;. Rule based approaches were replaced by deep conditional neural fields (Chiu et al., 2015;Chiu and Marsella, 2014) and Hidden Markov Models for prosody-driven head motion generation (Sargin et al., 2008) and body motion generation (Levine et al., 2009(Levine et al., , 2010. These use a dictionary of predefined animations, limiting the diversity of generated gestures. Moving forward, neural networks were employed to predict a sequence of frames for gestures (Hasegawa et al., 2018), head motions (Sadoughi and Busso, 2018) and body motions (Shlizerman et al., 2018;Ahuja et al., 2019;Ginosar et al., 2019;Ferstl et al., 2019) conditioned on a speech input while Yoon et al. (2019) uses only a text input. Unlike these approaches, Kucherenko et al. (2020) rely on both speech and language for gesture generation. But their choice of early fusion to com-bine the modalities ignores multi-scale correlations (Tsai et al., 2019) between speech and language.
While publicly datasets of co-speech gestures are available, they are either small (Sadoughi et al., 2015;Tolins et al., 2016;Yoon et al., 2019) or do not contain language information (Ginosar et al., 2019;Joo et al., 2015;, which motivates for a dataset that resolves these shortcomings.
To tackle the precision-coverage trade-off, methods have been introduced for out-of-distribution detection but they do not work for implicit models like GANs (Nalisnick et al., 2019). These approaches have similarities to importance weighting (Byrd and Lipton, 2018;Katharopoulos and Fleuret, 2018), which are often used for post-hoc debiasing of the learnt model (Domke and Sheldon, 2018;Grover et al., 2019;Turner et al., 2018), correcting covariate shift (Shimodaira, 2000), label shift Garg et al., 2020), imitation learning (Murali et al., 2016;Kostrikov et al., 2018) and curriculum learning (Jiang et al., 2015;Bengio et al., 2009;Matiisen et al., 2019). Byrd and Lipton (2018) observe that sub-sampling from unbalanced categorical classes demonstrates a significant effect on the network's predictions. Importance sampling Figure 2: Overview of the key components of our model. Starting at the dataset and going clockwise, audio and transcripts go through sub-word alignment in the generator G θ and are decoded to generate a freeform gesture animation. Next, the AISLe updates the weighted sampler of the dataset based on the output of the discriminator D η to complete the loop.
in GANs (Diesendruck et al., 2019;Yi et al., 2019), which uses re-weighting of maximum mean discrepancy between source and target distributions, has shown to improve the coverage in cases of unbalanced datasets, but do not provide insights on precision and coverage in the presence of conditional inputs.

Problem Statement
The goal of this cross-modal translation task is to generate a series of freeform gestures that are aligned with the spoken sentence (see Figure 2). By free form gestures, we refer to a sequence of joint positions (a.k.a. poses) of the upper human body including neck, torso, arms, hands and fingers. On our way to achieving this goal we work towards solving two challenges: (1) generating gestures from the long-tail of the language-gesture distribution while maintaining high precision of these generated gestures and, (2) sub-word level alignment of language, acoustic cues and gestures to account for the differences in frame rates between among these modalities.
Formally, we are given a sentence of K language tokens X w = x w 0 , x w 1 , . . . x w K−1 which has a dynamic frame rate -i.e. each token has a variable time duration dependent on its context-as compared to the fixed frame rate of a sequence of speech features, X a = x a 0 , x a 1 , . . . x a T −1 . We want to predict a sequence of T gesture poses Y p = y p 0 , y p 1 , . . . y p T −1 that co-occur with X a and X w . Here y p t ∈ R J×2 are the xy-coordinates for t th frame for J joints of the body skeleton.
This problem can be formalized as learning a true conditional probability distribution p data (y|x) of output y = Y p , given input x = {X a , X w } consisting of text and speech. We write this in form of a generator function G θ with trainable parameters θ as: whereŶ p are generated poses from the learnt conditional distribution p θ (y|x), which is an approximation of p data . G a enc and G w enc are the acoustic and language encoders, G attn is the multimodal attention block and G dec is the pose decoder.
All our experiments are in an adversarial set-up to alleviate the challenge of overly smooth generation (Ginosar et al., 2019) caused by the reconstruc- The generated pose sequenceŶ p is fed as a signal for the adversarial discriminator D η , which tries to classify the true pose Y p from the generated poseŶ p . This is jointly trained with the generator, which learns to fool the discriminator by generating realistic poses. This adversarial loss (Goodfellow et al., 2014) is written as: (3) The model is jointly trained to optimize the overall loss function L(y, x), where L mix is a loss for training mixture of generators and defined in Section 4.3.

Model
In this section, we present our Adversarial Importance Sampled Learning (or AISLe) paradigm which is designed to improve coverage while learning accurate relationships between spoken language and gestures. This contribution is described in Section 4.1. Our second contribution is the application of a transformer architecture to the problem of sub-word alignment between language and acoustic features. This model Multimodal Multi-Scale Transformer (MMS-Transformer) is presented in Section 4.2. The remaining components of our full model; pose decoder G dec , language encoder G w enc and acoustic encoder G a enc are presented in Section 4.3. The key contributions are illustrated in Figure 2 and can be summarized by optimizing the overall loss function L(y, x) with AISLe in Algorithm 1.

Adversarial Importance Sampled
Learning (or AISLe) To improve coverage, we want to be sure that the learnt distribution p θ (y|x) is a good approximation of the underlying distribution p data (y|x), including the long tail. Our intuition to solve this problem is to have our model give adaptive importance to the long tail of the gesture distribution while still allowing access to the more likely regions (i.e. modes) of the distribution (see Figure 1). This can be achieved by introducing a multiplicative weight factor w η (x) = p θ (ỹ|x) p data (ỹ|x) to the expected loss function, where L(y, x) is the overall loss function and p(x) is the marginal distribution of the input (i.e. language and acoustics). At a high level, as training progresses, if the generated sample has more likelihood of being generated by the learnt distribution than the true data distribution, it is given more importance. As this process reaches a desired equilibrium, where p θ p − → p data , w η (x) will approach 1 and revert back to the unweighted loss function.
We first derive this weighted function, then show how w η can be estimated practically in tandem with the adversarial setup of our problem without any additional computational cost, Finally, we tie it all up with an algorithm for AISLe. Deriving the Weighted Loss Function: Unlike prior work (Katharopoulos and Fleuret, 2018;Diesendruck et al., 2019), we derive the weighted cost function in Equation 5 using first principles. As illustrated in Figure 1, we divide the support of p data into a grid of multi-dimensional bins of size (δy, δx) ∈ R dim(y)+dim(x) where dim(.) gives dimensions of a variable. If (δy, δx) is sufficiently small, it is a reasonable assumption that all samples (i.e. pair of poses and spoken words) in this bin will be close to each other. Hence, if the model was to see some, and not all of the samples in this bin, it would still be able to learn the dynamics between poses and spoken words. As bins in the mode of the distribution have more samples than bins in the tail, the model would learn from samples in the tail less often if we optimize over an unweighted loss function given by E x∼p(.) E y∼p data (.|x) L(y, x). This is visually illustrated by the weights proportional to bin frequency in Figure 1(a).
To counteract this imbalance, we first perform a static rebalance of the expected cost by assigning the same weight to each bin as shown in Figure 1(b). This encourages that equal number of samples are drawn from each bin while training, Second, the importance of each bin is proportional to the likelihood of generated sample belonging to the proposal distribution p θ , i.e. if a sample is more likely to have been generated by p θ than p data , then the model has yet to learn the corresponding bin. Multiplying p θ to the numerator in Equation 6 gives us Equation 5. This appears as adaptive weighting across the support of the data distribution as shown in Figure 1(c). Estimation of Importance Weights: We follow a likelihood-free approach (Grover et al., 2019;Turner et al., 2018) to estimate w η by computing the outputs of the discriminator D η . Rewriting w η in Equation 5 as, As D η is learnt while optimizing L(y, x) and is computed for every training iteration, there is no additional computational cost in estimating weights while training. The estimated importance weights are used for data duplication while training (Diesendruck et al., 2019), which is an equivalent alternative to optimize weighted loss functions. We illustrate the weight update cycle in Algorithm 1.

Multimodal Multiscale Attention Block
To address the challenge of sub-word alignment, we take inspiration from recent work self-attention (Vaswani et al., 2017) and cross-attention models (Tsai et al., 2019) to alleviate the need of explicit alignment between audio and language embeddings. Note that these modalities provide complimentary information for gesture prediction: audio estimates rhythm, pauses and speed of the gestures (i.e. beat gestures) while language can be helpful for iconic or metaphoric gestures (Cassell, 2001). A multimodal attention mechanism can make use of sub-word information from the audio to drive well-timed and meaningful gesture animation.
Consider a temporal sequence of audio embeddings G a enc (X a ) = Z a ∈ R T ×h a and language embeddings G w enc (X w ) = Z w ∈ R N ×h w . We define audio query as Q a = Z a W Q a , language key as K w = Z w W K w and language values as V w = Z w W V w . Here W Q a ∈ R h a ×h , W K w ∈ R h w ×h and W V w ∈ R h w ×h are trainable weights. Subword information from audio is learnt via a cross modal attention CM. (Tsai et al., 2019), we precede crossmodal attention with a layer of self attention (Vaswani et al., 2017) which learns correlations between the low-level language features before assessing sub-word information from the audio modality. After cross-modal attention, we add layer normalization (Ba et al., 2016) followed by a pointwise feedforward layer along with residual connections as described in (Vaswani et al., 2017;Tsai et al., 2019;Devlin et al., 2018). Z aw is now the same scale as the audio input and hence is concatenated with Z a . This completes the multimodal multiscale attention block G attn .

Other Network Components
Decoder G dec : The decoder G dec takes aligned multimodal representations from G attn to generate output pose sequences. We start with a 1D U-Net (Ronneberger et al., 2015) following suit in (Ginosar et al., 2019) to get Z = U-Net([Z aw Z a ]). In addition, the distribution of gestures contains multiple modes. Hence, to prevent mode collapse we use mixture-model guided sub-generators (Ahuja et al., 2020;Hao et al., 2018;Arora et al., 2017; Table 1: Human perceptual study comparing our model with prior work and strong baselines over four criteria measuring quality of co-speech gestures. we report the preference scores (higher is better) of a model as compared to the ground truth gestures. 90% confidence intervals around the mean performance and calculated by a bootstrapped t-test are also reported.
Hoang et al., 2018), where ∀m, G m is the sub-generator function and φ m is the corresponding mixture model prior. While training, the true value of φ m can be estimated based on which sub-distribution the pose belongs to. At inference time, we do not have the ground truth pose to make such estimation. Instead, we train a classification network H to estimate φ m at inference time based on the input embedding Z. H is optimized via a mode regularization loss L mix = E Φ,Z CCE(Φ, H(Z)), where CCE is categorical cross-entropy and Φ = [φ 1 , .., φ M ].
Language Encoder G w enc : In order to utilize the semantic and contextual information of language, we fine-tune BERT for the task of gesture generation (Devlin et al., 2018) using an existing implementation with pre-trained weights (Wolf et al., 2019). The contextual dependence allows the model to be exposed to semantic differences in the meaning of the same word. These embeddings at model contextual dependence only at the word level leaving sub-word level dynamics to the multimodal attention block G attn .
Audio Encoder G a enc : For audio embeddings, we use a Temporal Convolutional Network (or TCNs), which has shown to perform well in speechconditioned pose generation task (Ginosar et al., 2019;Ahuja et al., 2019). In our experiments, we use an audio encoder based on Temporal Convolution Networks consisting of a convolution layer, followed by batch normalization (Ioffe and Szegedy, 2015), and ReLU (Nair and Hinton, 2010). We use a similar TCN network for the discriminator D η 1 . 1 We refer the readers to the appendix for exact implementation and hyperparameters.  (Perez et al., 2018). In addition to audio and text, features of duration of each word (i.e. start, end, percentage completed and so on) are used as inputs. To align audio and text, each token (i.e. text) is replicated to match its duration, hence performing an explicit alignment between text and audio. Ablation Models: Components AISLe and G attn are removed from the model one at a time to measure its contribution in gesture generation for the first set of ablation models. Static Rebalancing (Equation 6), which is one step before AISLe, is also used as an ablation model. Finally, top k% highest velocity regions (or tails) are used as a sub-sampled dataset. This is a manual method of importance sampling high velocity gestures.

Evaluation Metrics
Human Perceptual Study: We conduct a human perceptual study on Amazon Mechanical Turk (AMT) to measure human preference towards generated animations on four criteria, (1) naturalness, (2) expressivity, (3) timing and (4) relevance. We show a pair of videos with skeletal animations to the annotators. One of the animations is from the ground-truth set, while the other is a generation from our proposed model or a baseline. With unlimited time and for each criterion, users have to choose one video which they felt was better. We  Table 2: Quantitative comparison of our model as compared to existing work, and ablations with one component missing at a time. Comparisons in shows the impact of AISLe on coverage, while shows the impact of G attn in our model on precision Figure 4: Precision Coverage Tradeoff for all models. Lighter areas represent high PCK and low FID which is favourable for the model. Contour lines corresponds constant values of P CK F ID . We show impacts of AISLe, G attn and dataset subsampling with dotted lines traversing the PCK-FID plane, with our model enjoying the best of both worlds. run this study for randomly selected with 20 pairs of videos per model per speaker from the held-out set, giving a total of 1500 sample points for each model. We refer the readers to the appendix for more details of the setup.
Precision: To measure the accuracy of the generated gesture we use two metrics, (1) Probability of Correct Keypoints (PCK) (Andriluka et al., 2014;: the values are averaged over α = 0.1, 0.2 as suggested in (Ginosar et al., 2019) and (2) Mode Classification F1: if the generated pose (Ŷ p ) lies in the same cluster as the ground truth, it was sampled from the correct mode. F1 measure, for this classification task, is used to measure correctness of gesture generation. Coverage: to measure the coverage of the generated distribution we use two metrics, (1) Fréchet Inception Distance (FID): distance between distributions of generated and ground truth poses (Heusel et al., 2017). (2) Wasserstein-1 distances (or W1): distance between distribution of generated and ground truth average velocity. The same distance is calculated for average acceleration.

Pose, Audio, Transcripts and Style (PATS) dataset
We extend the Pose, Audio, Transcripts and Style (PATS) dataset (Ahuja et al., 2020) with automatically extracted transcripts for audio signals to study the effect of language and speech on co-speech gesture generation. It offers data for 25 speakers with diverse gestures and linguistic content (Ahuja et al., 2020;Ginosar et al., 2019). Specifically, it contains 15 talk show hosts, 5 lecturers, 3 YouTubers, and 2 televangelists, providing a total of 251 hours of video clips, with a mean of 10.7 seconds and a standard deviation of 13.5 seconds per clip.

Dataset Features
Aligned Transcriptions: As manual transcriptions are often not aligned and not readily available, we use Google Automatic Speech Recognition (Chiu et al., 2018) to collect subtitles and aligned timings of each spoken word. The average Word Error Rate of the transcriptions, calculated on the set of available transcriptions (i.e. subtitles), using the Fisher-Wagner algorithm is 0.29 (Navarro, 2001).
Pose: Each speaker's pose is represented via skeletal keypoints collected via OpenPose (Cao et al., 2018) following the approach in Ginosar et al. (2019). It consists of of 52 coordinates of an indi-

Results and Discussions
First, we study the effect of different components of our model on coverage and precision. We follow this up with the quantitative effects of dataset subsampling. Finally, we conclude with a discussion on the need of a precision-coverage trade-off for cospeech gesture generation. All models are trained separately for each of 25 speakers in PATS dataset and we report scores averaged over all speakers for comparison.
Comparison with previous baselines: We focus first on the human perceptual study in Table 1, since it is arguably the most important metric. We see a significantly 2 larger preference for our model as compared to S2G and Gesticulator for all four criteria. Specifically, expressivity sees the largest jump, indicating improved coverage in the generated gestures. A similar trend is seen on the objective scores for coverage in Table 2 which indicates a possible correlation between high coverage and human-judged expressivity of gestures. Interestingly, PCK score for S2G is not significantly different from ours, indicating that a simple accuracy metric may not be sufficient to judge performance in a co-speech gesture generation task. Impact of AISLe on Coverage: Incorporating 2 significance refers to statistical significance inferred using a 90% confidence interval estimated by a 2-sided t-test AISLe while training a generative model shows significant gains for coverage metrics in Table 3 . We observe that the use of Static Rebalancing (Equation 6) instead, which is an extreme version of AISLe, is better than not resampling at all. However, it is unable to reach the performance of AISLe on coverage metrics. A similar trend can be seen in the perceptual study scores in Table 1, where the addition of AISLe makes the generations preferable for most criteria. We also note that, while AISLe generates significant gains for coverage metrics, it still maintains the same level of precision as compared to Static Rebalancing.
Next, we visually compare the distribution of the generated gestures. We use average velocity of the body as a statistic as motion (or energy (Pelachaud, 2009)), which is one of the key indicators of naturalistic gestures. In Figure 3, we observe that our model( ) is able to (nearly) generate the velocity distribution of the ground truth. Models without AISLe shift the velocity of the generated distribution closer to zero indicating more gestures were generated with no or little motion, unlike the true data distribution (compare and ).
Impact of G attn on precision: Removal of Multimodal Multiscale Attention Block (G attn ) from our model results in significant performance dip of precision metrics in Table 2 . Relevance of generated gestures to the corresponding spoken language also suffers a significant decrease without G attn in Table 1. These support our hypothesis that a representation which explicitly learns subword attentions between text and audio is a better predictor of the corresponding gestures.
Impact of a Sub-sampled dataset on Precision and Coverage: We find, in Table 3 , that pruning the dataset to select samples which have a high average velocity (or Ours w/o AISLe w/ top x%), is a simple way of improving the support of the generated distribution. While this approach of resampling is a strong baseline for distribution coverage, it reduces the generalizability of the model -i.e. sharp decrease in PCK and F1 scores-probably due to the missing low velocity examples during training which is undesirable. Precision Coverage Trade-off: We observe that models without AISLe may have comparable PCK scores to our model but have significantly worse coverage and hence are not close to the true gesture distribution. Furthermore, models with static rebalancing have improved FID scores, but fail to generalize over precision. In Figure 4, the lighter regions have better PCK and FID scores indicating both high precision and high coverage of a given model. It would make the evaluation more robust, if we consider precision and coverage as a trade-off instead of two independent criteria. We observe that employing AISLe and G attn helps our model ( ) to enjoy the best of both worlds by striking a balance between precision and coverage.

Conclusions
In this paper, we studied the relationship between spoken language and free-form gestures. First, we introduced Adversarial Importance Sampled Learning, which combines adversarial learning with importance sampling to strike a balance between precision and coverage at no extra computational cost. Second, this work also introduced the use of transformers for gesture generation conditioned on spoken language. Third, we extended the PATS dataset in (Ahuja et al., 2020) by extracting transcripts for audio signals to study the effect of language in co-speech gesture generation. We substantiated the effectiveness of our approach through largescale quantitative and user studies and show significant improvements over previous state-of-the-art approaches on both precision and coverage.