Read my points: Effect of animation type when speech-reading from EMA data

Three popular vocal-tract animation paradigms were tested for intelligibility when displaying videos of pre-recorded Electromagnetic Articulography (EMA) data in an online experiment. EMA tracks the position of sensors attached to the tongue. The conditions were dots with tails (where only the coil location is presented), 2D animation (where the dots are connected to form 2D representations of the lips, tongue surface and chin), and a 3D model with coil locations driving facial and tongue rigs. The 2D animation (recorded in VisArtico) showed the highest identiﬁcation of the prompts.


Introduction
Electromagnetic Articulography (EMA) is a popular vocal-tract motion capture technique used increasingly for second language learning and speech therapy purposes.In this situation, an instructor aids the subject to reach a targeted vocal tract configuration by showing them a live augmented visualization of the trajectories of (some of) the subject's articulators, alongside a targeted configuration.
Current research into how subjects respond to this training uses a variety of different visualizations: Katz et al. (2010) and Levitt et al. (2010) used a 'mouse-controlled drawing tool' to indicate target areas as circles on the screen, with the former displaying an 'image of [the] current tongue position', the latter displaying a 'tongue trace'.Suemitsu et al. (2013) displayed a mid-sagittal representation of the tongue surface as a spline between three sensors along the tongue, as well as showing a palate trace and lip coil positions and targets as circles.Katz and Mehta (2015) used a 3D avatar with a transparent face mesh, pink tongue rig, including colored shapes that lit when touched as targets.
For audiovisual feedback scenarios the optimal manner of presenting the stimuli has not yet been explicitly studied, but rather the experiments have reflected recent software developments.Meanwhile, different tools (Tiede, 2010;Ouni et al., 2012) have emerged as state of the art software for offline processing and visualization.The claim that subjects make gains in tongue gesture awareness only after a practice period with the visualization (Ouni, 2011) underlies the need for research into how EMA visualizations can best be presented to subjects in speech therapy or L2-learning settings.
The main inspiration for this work is the finding of Badin et al. (2010) that showing normallyobscured articulators (as opposed to a full face, with and without the tongue) has a positive effect on the identification of VCV stimuli.An established body of research already focuses on quantifying the intelligibility-benefit or realism of animated talking heads, ideally as compared to a video-realistic standard (Ouni et al., 2007;Cosker et al., 2005).However, as the articulators that researchers/teachers wish to present to their subjects in the aforementioned scenario are generally outside the line of sight, these evaluation methods cannot be directly applied to intra-oral visualizations.We aim to fill this gap by comparing commonly-used EMA visualizations to determine which is most intelligible,1 hoping this may guide future research into the presentation of EMA data in a visual feedback setting.

Method
In this experiment, animations of eighteen CVC English words were presented in silent conditions to participants of differing familiarity levels with vocal tract animations in an online survey; subjects were asked to identify the word in a forcedchoice paradigm (a minimal pair of the prompt could also be chosen) and later give qualitative feedback about their experience speech-reading from the different systems. 2

Participants
Participants were recruited through promotion on social media, university mailing lists, on the internet forum Reddit and on Language Log.In sum, 136 complete responses were collected, with three of these excluded for breaking the experiment over several days.We analyze the results of all 84 native English speakers.Participants had varying levels of previous exposure to vocal tract animations: of those analysed 43% had seen such animations before, 25% had no exposure, 25% had studied some linguistics but not seen such animations, and 6% considered themselves experts in the topic.

Stimuli
The prompts presented were nine minimal pairs of mono-syllabic CVC words spoken by a single British female speaker recorded for the study of Wieling et al. (2015).
Three of the pairs differed in the onset consonant, three in the vowel, and three in the coda consonant.Care was taken that the pairs had a significant difference in place or manner that would be visible in the EMA visualization.
In order to compare the animations, they were standardized as follows: a frontal view was presented on the left half of the screen, a mid-sagittal 2 The experimental design also collected data about whether subjects could perceive differences between the competing animation paradigms, for a separate research question.view with the lips to the left on the right half.No waveform or labeling information was displayed.Lip coils were green, tongue coils red and chin/incisor coils blue.Where surfaces were shown, lips were pink, and tongues were red.A palate trace, made using each tool's internal construction method, was displayed in black.A white or light grey background was used.

Onset
The animations were produced as follows: Dots with tails were produced using functions from Mark Tiede's MVIEW package (Tiede, 2010), with an adapted video-production script for the standardizations mentioned above.2D animations were produced from VisArtico (Ouni et al., 2012), using the internal video-production processes.3D animations were produced using a simulated real-time animation of the data in Ematoblender (James, 2016), which manipulates an adapted facial rig from MakeHuman in the Blender Game Engine.See Figure 1 for examples of the three types of visualizations.

Procedure
This experiment was hosted on the platform Sur-veyGizmo.Firstly the EMA data was explained and participant background information was collected.This included information about previous exposure to linguistics studies and vocal tract visualizations.A brief training session followed, in which participants saw four prompts covering a wide range of onset and coda consonants in all three animation systems.They were free to play these animations as many times as they wished.
Subsequently, subjects were presented with two silent animations.The animations were either matching or non-matching (minimal pair) stimuli, which were displayed as HTML5 videos in web-friendly formats.They were controlled only by separate 'Play' buttons below each video.For each of these animations the subject was presented with four multiple choice options (one correct, one minimal pair, one randomly chosen pair, with the items and order retained across both questions).They were also asked to rate whether they believed the two stimuli to be the same word or not.
Upon submitting their answers, the subject was asked to view the videos again (as often as they liked) with sound, allowing them to check their answers and learn the mapping between animation and sound.The time that they spent viewing each prompt (for identification and after the an- swer was revealed) was also measured.After each three questions they were asked to rate their confidence at guessing the prompts' identities.Then after twelve questions they were asked to comment about their strategies.Finally, they could complete another six questions, or skip to the concluding qualitative questions.

Data Analysis
The prompt identification task yielded a binomial dataset based on the correctness of the identification.The random assignment of prompt pairs to system combinations led to an unbalanced dataset, which motivated the use of generalized linear mixed-effects regression models (GLMMs) for analysis (Bates et al., 2015).Random intercepts and slopes were included if they improved the model in a model comparison procedure.
In order to take into account the variability in subject responses, random intercepts for subject were included.Similarly, random intercepts were included for each prompt.The prompt variability was quite extensive and is visualized in Figure 2

Results
The resulting model for the identification data included random intercepts for the subject, random intercepts for the prompt (with a random slope for the match-mismatched condition), and a fixed effect for the system, shown in Table 2.The 2D animation was significantly better-identified than the 3D animation.The Dots animation was slightly (but not significantly) less well-performing than the 3D animation.
Even within the most intelligible system (2D graphics), it is evident that there is much variability in how well participants are able to identify the various prompts (see Figure 2).A generalized logistic mixed-effects regression model was fitted to analyze the effects of onset and coda consonants and the nuclear vowel in the prompts.
When assessing the effect of either onset, coda or nucleus on how well people were able to detect the correct utterance, we found that the type of nucleus (i.e. the vowel) was most important.For example, whenever a stimulus contained the vowel /a/ its recognition was better than with a different vowel.In contrast, a stimulus with the vowel /i/ was much less well recognized.As the vowel necessitates greater movements of especially the lips than consonants, it makes sense that the type of vowel is an important predictor.Given that we only had a limited number of stimuli, including the onset or coda together with the nucleus did not help predict the recognition of the stimulus.
The hypothesized effect on the identification score of question number and time spent watching the videos (a learning effect was expected) was not borne out in the results.Though many subjects improved over time, others worsened, which could be attributed to fatigue or boredom during the long experiment.Similarly, including the subjects' previous experience with linguistics and vocal tract visualizations did not significantly improve the model.

Identification strategies
The model's identification of the ease of interpreting 2D animations was reflected in participants' comments about the strategies they used for speech-reading.The frequency with which these strategies were mentioned is shown in Table 3 One participant (ID 1233) summed up the particular difficulty of the 'dots with tails' system succinctly: "In the ones with lips and tongue, I spoke each of the possible answers myself and tried to envision how closely my own lips and tongue resembled the videos.In the one with just dots, I was purely guessing."

Pitfalls of the 3D animation
Whereas it might seem somewhat surprising that the 3D animation did not result in (significantly) better recognition over the simplest representation (dots with tails), participants' comments highlight some possible causes.
Firstly, the colors of the lips and tongue were similar, which was especially problematic in the front view of this experiment.Though the color choices were made based on VisArtico's color scheme, the 2D animation avoids this problem by excluding the tongue from the frontal view.
Secondly, participants expressed that they would have liked to see teeth and a facial expression in the 3D animation.They also commented that they expected more lip-rolling movement.Indeed, seeing a more realistic avatar with these crucial elements missing may have been somewhat unnatural-looking.
Some linguistically-experienced participants also indicated that they expected a detailed 3D avatar to also indicate nasality, the place where the soft and hard palates meet, or 'what the throat is doing'.Unfortunately, this information is not available using EMA data.
Finally, many subjects commented that they found the 3D animation 'too-noisy' and preferred the 'clean' and 'clearer' 2D option.3Subjects' descriptions of their personal identification strategies indicates that they often used lip-reading strategies, and that this was easier in 2D where the lip shape was clear, and there was no difficulty with any color contrasts from the tongue.While the graphics quality of the 3D system was not as clear as for the other systems, the setup is similar to the 3D state of the art such as reported in Katz et al. (2014).4

Additional observations
Though the speaker and analyzed participants all identified themselves as English native speakers, two American participants noted that they perceived the British speaker as having a foreign/German accent.Several participants mentioned that their main tactic was mimicking the speaker saying the answer options (and in doing so mimicking their interpretation of the speaker's accent), which they on occasion found difficult.This underlines the usefulness of using dialectappropriate trajectories for the speech-reader.
In this experiment, all animations were based on EMA recordings from a single speaker in one recording session.In general usage however, the differing coil placement for each subject and recording session may also affect the identification ability.Other visualization methods (e.g., cineradiography or MRI) give a high-dimensional picture of the vocal tract and avoid these problems.However, these technologies are not practical for real-time speech training due to their health-risk and cost, respectively.One strategy to compensate for this problem when creating the animations is to use photos of the coil placement during recording to manually specify the offset from the intended placement on the articulator.For example, VisArtico allows the user to specify whether the lip coils were placed close to or above/below the lip opening.

Conclusion
In sum, the simplicity and clarity of 2D graphical animations is preferable for subjects to identify silent animations of EMA data.The features of the most successful animation paradigm suggest that future EMA-animations should include both indications of lip and tongue surface shape.If used, 3D models should ensure that they provide clear and clean demonstrations, in which the edges of the articulators (particularly in the frontal view) can easily be distinguished.

Table 1 :
Prompt minimal pairs, by location of difference. .

Table 3 :
. Identification strategy frequency by number of mentions over all participants.