Towards end-2-end learning for predicting behavior codes from spoken utterances in psychotherapy conversations

Spoken language understanding tasks usually rely on pipelines involving complex processing blocks such as voice activity detection, speaker diarization and Automatic speech recognition (ASR). We propose a novel framework for predicting utterance level labels directly from speech features, thus removing the dependency on first generating transcripts, and transcription free behavioral coding. Our classifier uses a pretrained Speech-2-Vector encoder as bottleneck to generate word-level representations from speech features. This pretrained encoder learns to encode speech features for a word using an objective similar to Word2Vec. Our proposed approach just uses speech features and word segmentation information for predicting spoken utterance-level target labels. We show that our model achieves competitive results to other state-of-the-art approaches which use transcribed text for the task of predicting psychotherapy-relevant behavior codes.


Introduction
Speech interfaces have seen a widely growing trend and this has brought about increasing interest in advancing computational approaches to spoken language understanding (SLU). (Tur and De Mori, 2011;Xu and Sarikaya, 2014;Yao et al., 2013;Ravuri and Stolcke, 2015). SLU systems often rely on Automatic speech recognition (ASR) for generating lexical features. The ASR output is then used for the target natural language understanding task. Furthermore, end-2-end SLU systems for various applications, including speech synthesis (Oord et al., 2016), ASR tasks (Amodei et al., 2016;Chan et al., 2016;Soltau et al., 2016) and speech-2-text translation (Chung et al., 2019) have shown promising results. Recently (Haque et al., 2019) propose a method for learning audio-linguistuc embedding but that too depends on using transcribed text. Lower part shows our proposed approach where we predict behavior codes without using transcripts Due to the nature of the speech processing pipeline, natural language understanding tasks suffer from two major problems, 1) error propagation through ASR leading to noisy lexical features 2) loss of rich information which supplement lexical features, such as prosodic and acoustic expressive speech patterns.
In this paper, we propose a framework to address the problem of predicting behavior codes directly from speech utterances. We focus on data from Motivational Interviewing (MI) sessions, a type of talk-based psychotherapy focused on behavior change. In psychology research and clinical practice, behavioral coding is often used to understand process mechanisms and therapy efficacy and outcomes. Behavior codes are annotated by an expert at an utterance level (or interaction level) by listening to the session. Examples of utterance level behavior codes include if there was a simple of complex reflection by the therapist of their patient's previous utterance(s). Several approaches have been proposed for automatic prediction of behavior codes, mainly using lexical features and/or linguistic features such as information from dependency trees (Xiao et al., 2016;Tanana et al., 2016;Pérez-Rosas et al., 2017;Cao et al., 2019;. Recent works (Singla et al., 2018;Chen et al., 2019) reveal that using acoustic and prosodic features in addition to lexical features outperforms single modality models.
Speech2Vec (Chung and Glass, 2018) has shown that high quality word representations can be learnt by just using speech features. It learns word representations in an unsupervised manner using an objective similar to the Skipgram objective of Word2Vec (Mikolov et al., 2013) (a word representation should be representative of its context words) and sequence-to-sequence framework. However, Speech2Vec only aims to learn word representations which are averaged spoken-word representations of that word in the corpus. Our proposed approach aims to exploit speech signal to word encoder learnt using an architecture similar to Speech2Vec as lower level dynamic word representations for the utterance classifier. Thus, our system never actually needs to know what word it is but only word segmentation information. We hypothesize that word segmentation information can be obtained with cheaper tools, e.g. a supervised word segmentation system (Tsiartas et al., 2009) or a heuristics based system based on acoustic and prosodic cues (Junqua et al., 1994;Iwano and Hirose, 1999). We plan to investigate the effect of noise in word boundaries on encoder quality in the future.
Our end-2-end transcription-free approach is similar and perhaps even motivated some of the previous works. There have been some works (Serdyuk et al., 2018;Lugosch et al., 2019) which perform prediction tasks directly from speech signals but lack in capturing the underlying linguis-tic structure of a language (sentences break into words for semantics). We believe capturing some of the important linguistic units (e.g. words) are important for spoken language understanding. (Qian et al., 2017) is most similar to our work in terms of overall architecture as they also first get word level representations and then use the encoder for utterance level prediction. However (Qian et al., 2017) uses transcribed word transcriptions but we only use word boundaries for ASR-free end-2-end spoken language understanding. As shown in Figure  1, most previous works follow the upper pipeline. They start with a transcript (manually generated or through an ASR), which is first segmented into utterances. They then use word-embeddings for each word in the transcript before feeding it into a classifier to predict target behavior codes.
Our approach shows competitive results when compared to state-of-the-art models which use transcribed text. Our target application domain in this work is psychotherapy. While utterance level behavior coding is a valuable resource for psychotherapy process research, it is also a labor intensive task for manual annotation. Our proposed method which does not rely on transcripts should help with cheaper and faster behavioral annotation. We believe this framework can be a promising direction to directly perform classification tasks given a spoken utterance.

Our Approach
We first learn a word-level speech signal to word encoder using a sequence-to-sequence framework.
Speech-2-Vector follows the learning objective similar to Skipgram architecture of Word2Vec. We then use the pre-trained encoder to predict behavior codes.

Speech signal to word encoder
Our Speech signal to word encoder (SSWE) encoder is an adaptation of Speech2Vec (Chung and Glass, 2018) which in turn is motivated by Word2Vec's skipgram architecture. The model learns to predict context words given a word. But unlike Word2Vec, in SSWE, each word is represented by a sequence of speech frames. We adopt the widely known sequence-to-sequence architecture to generate context words given a spoken word. Our model generates speech features for context words (X n−4 , X n−3 , ....., X n+4 ) given speech features for a word X n . As input for word X n , it takes K * 13 dimensional MFCC features extracted from every 25 ms window of speech audio using a frame rate of 10ms. K is the maximum number of frames a spoken word can have. This input is then processed through a bidirectional LSTM layer (Hochreiter and Schmidhuber, 1997) to generate the context vector C. C is then used by a unidirectional LSTM decoder to generate the speech features for words in context (Y n−4 , Y n−3 , ....., Y n+4 ). We optimize the model by minimizing the mean squared loss between predicted and target outputs: Following this approach, our system never uses any form of explicit transcriptions for learning the encoder, just only the word boundaries. Figure 2 gives a pictorial description of this process.
Our Speech-2-Vector encoder is trained using a speech corpus and word segmentation information. In our setup, we assume we have high quality word segmentation information. For the purpose of our experiments, we obtain the word segmentation information using a Forced-aligner (Ochshorn and Hawkins, 2016) (it uses transcripts but we only use it for word segmentation, we plan to replace it with other tool). The forced aligner primarily gives boundaries for the start and end of a word, which are then used to get speech features for a word. We hypothesize that learning word segmentation is a cheaper task than training a full-blown ASR. Figure 3 shows the picturesque view of our utterance classifier. Given a word-segmented utterance, we first process speech features for each word to    ). We then learn a function c = f(W) that maps W to a behavioral code c1, 2, ..., C, with C being the number of defined target code types. We use a parametric composition model to construct utterance-level embeddings from wordlevel embeddings. Word-level representations (W i , ....., W n ) are then fed into a bidirectional LSTM layer to contextualize the word embeddings. Contextualized word embeddings are then fed to a self-attention layer to get a sentence representation S which is then used to predict the behavior code for an utterance using a dense layer which projects it to C dimensions using a softmax operation. We use a self-attention mechanism similar to the one proposed in (Yang et al., 2016)

Dataset
We experiment with two datasets for training the S2V encoder: first on the LibreSpeech Corpus (Panayotov et al., 2015) (500 hour subset of broadband speech produced by 1,252 speakers) and second, directly on our classifier training data, which we describe below.
For classification, we use data from Motivational Interviewing sessions (a type of talk based psychotherapy) for addiction treatment presented in (Tanana et al., 2016;Pérez-Rosas et al., 2017). There are 337 transcribed sessions (approx. 160 hours of audio) coded by experts at the utterance level with behavioral labels following the Motivational Interviewing Skill Code (MISC) manual (Miller et al., 2003). Each human coder segmented talk turns into utterances (i.e., complete thoughts) and assigned one code per utterance for all utterances in a session. The majority of sessions were coded once by one of three expert coders.
In this paper, we use the strategy proposed by (Xiao et al., 2016) grouping all counselor codes into 8 categories (described in Table 1). We remove backchannels without timestamps which cannot be aligned and split the data into training and testing sets by sessions with roughly 2:1 ratio. This split is consistent with all compared works.

Training details
Speech-2-Vector Encoder: We implemented the model with PyTorch (Paszke et al., 2017). Similar to (Chung and Glass, 2018), we also adopted the attention mechanism which enables the Decoder to condition every decoding step on the last hidden state of the Encoder (Subramanian et al., 2018). The window size was set to 4. We train the model using stochastic gradient descent (SGD) with learning rate of 1e * −3 and batch size of 64 (spokenword, context) pairs. We experimented with hyperparameter combinations for: using bidirectional or unidirectional RNNs, using GRU vs LSTM cell, number of LSTM hidden layers and learning rates. We found there was not a big difference in encoder output quality with higher dimensions. Therefore, we use a 50 dimensional LSTM cell, thus the resulting encoder output becomes 100 (Bidirectional last hidden states) + 100 (cell state) = 200 dimensions.
Utterance Classifier: The chosen batch size was 40 utterances. The LSTM hidden state dimension is 50. We use dropout at the embedding layer with drop probability 0.3. The dense layer is of 100 dimensions. The model is trained using the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.001 and an exponential decay  Table 2: Using word embeddings learnt using speech features (Speech2vec) vs Word2Vec. * marks that model was only fine tuned for in-domain data. † marks that all these classifiers were trained end-2-end of 0.98 after 10K steps (1 step = 40 utterances). Similar to prior work, we also weight each sample according to normalized inverse frequency ratio.

Experiments & Results
Speech2Vec vs Word2Vec: Table 2 shows results where we compare performance of the system when we use lexically-derived word embeddings (word2Vec) vs speech-features derived word embeddings (Speech2Vec). If a word appears in a corpus n times, then speech2vec uses a system similar to our Speech-2-Vector encoder and averages them to get a word embedding for that dictionary word. Results confirm two main observations: 1) It is better to learn/fine-tune the word embeddings on an in-domain dataset. 2) Speech2Vec that learns word embeddings based on different spoken variations of word provides better results for behavior code prediction. This result is consistent with findings from (Singla et al., 2018;Chen et al., 2019) where it is shown that acoustic-prosodic information can provide complementary information for predicting behavior codes and hence, produce better results. One challenge is that SSWE and Speech2Vec generally needs large amount of transcribed data to learn high quality word embeddings. Therefore, we first train SSWE on a general speech corpus (here, LibreSpeech (Libre)) before fine-tuning it on our classifier training data (results with * show this experiment). Transcriptions vs. No Transcriptions: Methods discussed above still rely on transcriptions to know what the word is. However, our proposed method does not use any explicit transcription but only the word segmentation information. Results in Table 3 show that using a pre-trained Speech-2-Vector encoder as a building block to get word representations can lead to competitive results to other methods which rely heavily on first generating transcripts of the spoken utterance. Here  we also compare our model to the multimodal approach proposed by (Singla et al., 2018;Chen et al., 2019) where they use word-level prosodic features along with lexical word embeddings. Prosodic and Word2Vec+Prosodic † show results for this system. Table 3 also shows that doing end-2-end training (results with *) where our Speech-2-Vector encoder is also updated by the classifier loss generates poor results. We hypothesize that it can be due to the fact that our behavior code prediction data was split to minimize the speaker overlap. Thus it becomes easier to overfit when we fine-tune it on some speaker-related properties instead of generalizing for behaviour code prediction task.

Conclusions
We show that comparable results can be achieved for behavior code prediction by just using speech features and without any ASR or human transcriptions. Our approach still depends on word segmentation information, however, we believe obtaining word segmentation from speech is comparatively easier than building a high quality ASR. The evaluation results show the application significance of an end-2-end speech to behavioral coding for psychotherapy conversations. This allows for building systems that do not include explicit transcriptions, an attractive option for privacy reasons, when the end goal (as determined by the behavioral codes) is to characterize the overall quality of the clinical encounter for training or quality assurance.

Future work
The results still vary and are worse compared to using human annotations. We plan to do a detailed analysis along two lines: 1) Comparing if the proposed modeling technique can help bridge gap between predicted and human annotations, and 2) Effect of environment variables e.g., background noise, speaker features, different languages etc. We believe our approach can benefit from some straightforward modifications to the architecture, such as using convolutional neural networks which have shown to perform better at handling timecontinuous data like speech.