Audio-Visual Understanding of Passenger Intents for In-Cabin Conversational Agents

Building multimodal dialogue understanding capabilities situated in the in-cabin context is crucial to enhance passenger comfort in autonomous vehicle (AV) interaction systems. To this end, understanding passenger intents from spoken interactions and vehicle vision systems is an important building block for developing contextual and visually grounded conversational agents for AV. Towards this goal, we explore AMIE (Automated-vehicle Multimodal In-cabin Experience), the in-cabin agent responsible for handling multimodal passenger-vehicle interactions. In this work, we discuss the benefits of multimodal understanding of in-cabin utterances by incorporating verbal/language input together with the non-verbal/acoustic and visual input from inside and outside the vehicle. Our experimental results outperformed text-only baselines as we achieved improved performances for intent detection with multimodal approach.


Introduction
Understanding passenger intents from spoken interactions and visual cues (both from inside and outside the vehicle) is an important building block towards developing contextual and scene-aware dialogue systems for autonomous vehicles.When the passengers give instructions to the in-cabin agent AMIE, the agent should parse commands properly considering three modalities (i.e., verbal/language/text, vocal/audio, visual/video) and trigger the appropriate functionality of the AV system.
For in-cabin dialogue between car assistants and driver/passengers, recent studies explore creating a public dataset using a WoZ approach (Eric et al., 2017) and improving ASR for passenger speech recognition (Fukui et al., 2018).Another recent work (Zheng et al., 2017) attempts to classify sentences as navigation-related or not using the CU-Move in-vehicle speech corpus (Hansen et al., 2001), a relatively old and large corpus focusing on route navigation.
We collected a multimodal in-cabin dataset with multi-turn dialogues between the passengers and AMIE using a Wizard-of-Oz (WoZ) scheme via realistic scavenger hunt game.In previous work (Okur et al., 2019), we experimented with various RNN-based models to detect the utterancelevel intents (i.e., set-destination, change-route, gofaster, go-slower, stop, park, pull-over, drop-off, open-door, other) along with the intent keywords and relevant slots (i.e., location, position/direction, object, gesture/gaze, time-guidance, person) associated with these intents.
In this work, we discuss the benefits of a multimodal understanding of in-cabin utterances by incorporating verbal/language input together with the non-verbal/acoustic and visual cues, both from inside and outside the vehicle (e.g., passenger gestures and gaze from the in-cabin video stream, referred objects outside of the vehicle from the road view camera stream).

Data
Our AMIE in-cabin dataset includes 30 hours of multimodal data collected from 30 passengers (15 female, 15 male) in a total of 20 sessions.In 10 sessions, a single passenger was present, whereas the remaining 10 sessions include two passengers interacting with the vehicle.Participants sit in the back of the vehicle, separated from the driver and the human acting as an agent at the front.The vehicle is modified to hide the operator and the WoZ AMIE agent from the passengers, using a variation of the WoZ approach (Wang et al., 2017).In each ride/session, which lasted about 1 hour or more, the participants were playing a realistic scavenger hunt game on the streets of Richmond, BC, Canada.Passengers treat the vehicle as AV and communicate with the WoZ AMIE agent mainly via speech commands.Game objectives require passengers to interact naturally with the agent to go to certain destinations, update routes, give specific directions regarding where to pull over or park (sometimes with gestures), find landmarks (refer to outside objects), stop the vehicle, change speed, get in and out of the vehicle, etc.Further details of the data collection protocol and dataset statistics can be found in (Sherry et al., 2018;Okur et al., 2019).See Fig. 1 for the vehicle instrumentation to enable multimodal data collection setup.

Dataset Statistics
Multimodal AMIE dataset consists of in-cabin conversations between the passengers and the AV agent, with 10590 utterances in total.1331 of these utterances have commands to the WoZ agent; hence, they are associated with passenger intents.Utterance-level intent and word-level slot annotations are obtained on the transcribed utterances by majority voting of 3 annotators.The annotation results for utterance-level intent types, slots and intent keywords can be found in Table 1 and Table 2

Methodology
We explored leveraging multimodality for the Natural Language Understanding (NLU) module in the Spoken Dialogue System (SDS) pipeline.As our AMIE in-cabin dataset has audio and video recordings, we investigated three modalities for the NLU: text, audio, and visual.
This is a 2-level hierarchical joint learning model that detects/extracts intent keywords & slots using sequence-to-sequence Bi-LSTMs first (Level-1), then only the words that are predicted as intent keywords & valid slots are fed into the Joint-2 model (Level-2), which is another sequence-to-sequence Bi-LSTM network for utterance-level intent detection, jointly trained with slots & intent keywords.
This architecture was chosen based on the bestperforming uni-modal results presented in previous work (Okur et al., 2019) for utterance-level intent recognition and slot filling on our AMIE dataset.These initial uni-modal results were obtained on the transcribed text with pre-trained GloVe word embeddings (Pennington et al., 2014).
In this study, we explore the following multimodal features to better assess passenger intents for conversational agents in self-driving cars: word embeddings for text, speech embeddings and acoustic features for audio, and visual features for the video modality.

Word and Speech Embeddings
We incorporated pre-trained speech embeddings, called Speech2Vec 1 , as additional audio-related features.These Speech2Vec embeddings (Chung and Glass, 2018) are trained on a corpus of 500 hours of speech from LibriSpeech.Speech2Vec can be considered as a speech version of Word2Vec embeddings (Mikolov et al., 2013), where the idea is that learning the representations directly from speech can capture the information carried by speech that may not exist in plain text.

Visual Features
Intermediate CNN features3 are extracted from each video clip segmented per utterance from the AMIE dataset.Using the feature extraction process described in (Kordopatis-Zilos et al., 2017), one frame per second is sampled for any given input video clip.Then, its visual descriptors are extracted from the activations of the intermediate convolution layers of a pre-trained CNN.We used the pre-trained Inception-ResNet-v2 model4 (Szegedy et al., 2016) and generated 4096-dim features for each sample (per utterance).We experimented with utilizing two sources of visual information: (i) cabin/passenger view from the back-driver RGB camera recordings, (ii) road/outside view from the dash-cam RGB video streams.

Experimental Results
Performance results of the utterance-level intent recognition models with various modality and feature concatenations can be found in Table 3, using hierarchical joint learning (H-Joint-2).For text and speech embeddings experiments, we observe that using Word2Vec or Speech2Vec representations achieve comparable F1-score performances, which are significantly below the GloVe embeddings performance.That was expected as the pretrained Speech2Vec vectors have lower vocabu-lary coverage than the GloVe vectors.On the other hand, we observe that concatenating GloVe + Speech2Vec embeddings, and further GloVe + Word2Vec + Speech2Vec yields higher F1-scores for intent recognition.These results show that the speech embeddings indeed can capture useful semantic information carried by the speech only, which may not exist in plain text.
We also investigate incorporating the audiovisual features on top of text-only and text + speech embedding models.Including openSMILE/IS10 acoustic features from audio as well as intermediate CNN/Inception-ResNet-v2 features from video brings slight improvements to our intent recognition models, achieving 0.92 F1-score.These initial results may require further explorations for specific intents such as stop (e.g., audio intensity & loudness could have helped), or for relevant slots such as passenger gesture/gaze (e.g., cabin-view features) and outside objects (e.g., road-view features).

Conclusion and Future Work
In this work, we briefly present our initial explorations towards the multimodal understanding of passenger utterances in autonomous vehicles.We show that our experimental results outperformed the uni-modal text-only baseline results, and with multimodality, we achieved improved performances for passenger intent detection in AV.This ongoing research has the potential impact of exploring real-world challenges with humanvehicle-scene interactions for autonomous driving support via spoken utterances.
There exist various exciting recent work on improved multimodal fusion techniques (Zadeh et al., 2018;Liang et al., 2019a;Pham et al., 2019;Baltruaitis et al., 2019).In addition to the simplified feature and modality concatenations, we plan to explore some of these promising tensor-based multimodal fusion networks (Liu et al., 2018;Liang et al., 2019b;Tsai et al., 2019) for more robust intent classification on AMIE dataset as future work.

Figure 1 :
Figure 1: AMIE In-cabin Data Collection Setup
AMIE In-cabin Dataset Statistics: Intents

Table 3 :
F1-scores of Intent Recognition with Multimodal Features