Automatic identification of head movements in video-recorded conversations: can words help?

We present an approach where an SVM classifier learns to classify head movements based on measurements of velocity, acceleration, and the third derivative of position with respect to time, jerk. Consequently, annotations of head movements are added to new video data. The results of the automatic annotation are evaluated against manual annotations in the same data and show an accuracy of 68% with respect to these. The results also show that using jerk improves accuracy. We then conduct an investigation of the overlap between temporal sequences classified as either movement or non-movement and the speech stream of the person performing the gesture. The statistics derived from this analysis show that using word features may help increase the accuracy of the model.

The method for automatic head movement annotation described in this paper is implemented as a plugin to the freely available multimodal annotation tool ANVIL (Kipp, 2004), using OpenCV (Bradski and Koehler, 2008), combined with a command line script that performs a number of file transformations and invokes the LibSVM software (Chang and Lin, 2011) to train and test a support vector classifier. Successively, the script produces a new annotation in ANVIL containing the learned head movements. The present method builds on (Jongejan, 2012) by adding jerk to the movement features and by applying machine learning. In this paper we also conduct a statistical analysis of the distribution of words in the annotated data to understand if word features could be used to improve the learning model.
Research aimed at the automatic recognition of head movements, especially nods and shakes, has addressed the issue in essentially two different ways. Thus a number of studies use data in which the face, or a part of it, has been tracked via various devices and typically train HMM models on such data (Kapoor and Picard, 2001;Tan and Rong, 2003;Wei et al., 2013). The accuracy reported i these studies is in the range 75-89%.
Other studies, on the contrary, try to identify head movements from raw video material using computer video techniques (Zhao et al., 2012;Morency et al., 2005). Different results are ob-tained depending on a number of factors such as video quality, lighting conditions, whether the movements are naturally occurring or rehearsed. The best results so far are probably those in (Morency et al., 2007), where an LDCRF model achieves an accuracy from 65% to 75% for a false positive rate of 20-30% and outperforms earlier SVM and HMM models.
Our work belongs to the latter strand of research in that we also work with raw video data.

Movement features
Three time-related derivatives with respect to the changing position of the face are used in this work as features for the identification of head movements: velocity, acceleration and jerk. Velocity is change of position per unit of time, acceleration is change of velocity per unit of time, and jerk is change of acceleration per unit of time. We expect that a sequence of frames for which jerk has a high value in the horizontal or vertical direction will correspond to the most effortful part of the head movement, often called stroke (Kendon, 2004).

Data, test setup, and results
The data come from the Danish NOMCO (Paggio et al., 2010), a video-recorded corpus of conversational interactions with many different annotation layers (Paggio and Navarretta, 2016), including type of head movement (nods, turns. etc).
For this work, two videos in which one of the participants is the same were selected at random, and only the head movements performed by this one participant are considered. One video is used for training, and the other for testing. In both videos, OpenCV is used to analyse each frame for the x and y coordinates of the participants's head, and based on these coordinates velocity, acceleration and jerk measures are calculated for each  frame and added to the video annotation. In the video used for training, each frame is added a boolean feature indicating presence or absence of head movement in the manual annotation. A first inspection of the classification results showed that in several cases the classifier detected sequences of movement interrupted by empty frames, where the manual annotation consisted of longer spans of uninterrupted movement. Therefore, empty spans (margins) of varying length were considered part of the movement annotation in the subsequent experiments, all performed with SVM. In all experiments, using all three movement features together yield the best results. When margin = 2 the ratio true positive/true negative is maximal. A maximum accuracy of 68%, however, is reached for a much higher value of the margin, 17 frames, or 0.68 seconds. For comparison, a baseline model always selecting non-movement would reach an accuracy of 64%. Counts for true and false movement and non-movement sequences detected by the classifier are shown in Table (1). Even though we can do better than the baseline, the accuracy is still not adequate. Considering the fact that the annotators who created the gold standard had access to the audio channel when they identified the head movements, it is worth considering whether word features could be used to train more sophisticated and accurate models.

Head movements and words
The relation between head movements and words was investigated by looking at how different kinds of words are distributed over sequences of movement vs non-movements. We thus considered distributions where the word category includes only real words, also filled pauses, only filled pauses and feedback words, and finally only stressed words. In all cases, we are only looking at the speech stream of the person performing the movement. The last two distributions show the least interesting effects. Thus, feedback words have almost equal, and very low, probability to occur in movement and non-movement sequences. In the   case of stressed words, we see that their probability of occurring with movement is slightly higher than with non movement (31% vs 20%). If we look at the distribution of all words vs no words including filled pauses, we see that words have a 58% probability of occurring with movement, as opposed to a only 36% probability of occurring with non-movement. Finally, if we take words including filled pauses against no words, the probability of word occurrence with movement is 75% vs 56% with non-movement. Thus, distinguishing between real words and no words including filled pauses has the potential to differentiate best between presence and absence of movement in that we see that in this case the mutual proportion between word and no words goes in opposite directions depending on the sequence type. The differences in the distribution in this case are significant on a chi-square test in both movement and non-movement sequences. All the probabilities are summed up in Tables (2) and (3) .
To conclude, we have presented an approach where an SVM classifier is trained to recognise movement sequences based on velocity, acceleration, and jerk. A preliminary investigation of the overlap between temporal sequences classified as either movement or non-movement and the speech stream of the person performing the gesture shows that using word features may help increase the accuracy of the model, which is now 68%.