Contextualized Representations for Low-resource Utterance Tagging

Utterance-level analysis of the speaker’s intentions and emotions is a core task in conversational understanding. Depending on the end objective of the conversational understanding task, different categorical dialog-act or affect labels are expertly designed to cover specific aspects of the speakers’ intentions or emotions respectively. Accurately annotating with these labels requires a high level of human expertise, and thus applying this process to a large conversation corpus or new domains is prohibitively expensive. The resulting paucity of data limits the use of sophisticated neural models. In this paper, we tackle these limitations by performing unsupervised training of utterance representations from a large corpus of spontaneous dialogue data. Models initialized with these representations achieve competitive performance on utterance-level dialogue-act recognition and emotion classification, especially in low-resource settings encountered when analyzing conversations in new domains.


Introduction
Spontaneous human conversations have been collected in different domains to support research in data-driven dialogue systems (Serban et al., 2015), affective computing (Zadeh et al., 2018;Busso et al., 2008;Park et al., 2014), clinical psychology (Althoff et al., 2016) and tutoring systems (Sinha et al., 2015).These conversations are analyzed by segmenting transcriptions into each speaker's utterances (Traum and Heeman, 1996), which are often labeled with different types of information.The exact type of label to be used depends on the downstream task or research questions to be answered, and thus the tagging paradigms are varied and numerous.For example, the speaker's intention can be specified using a dialogue acts (DAs) or speech acts (Searle and Searle, 1969), which capture the pragmatic or semantic function of the utterance.Utterances may also be tagged with traits such as sentiment, emotion and valence labels (Busso et al., 2008;Zadeh et al., 2018), speaker persuasiveness (Park et al., 2014), speaker dominance (Busso et al., 2008) and other characteristics at the utterance and conversational level.While these labels vary greatly, one constant is that they are often ambiguous and contextdependent (Table 1), making it challenging for humans to annotate efficiently and accurately.Thus, curating large corpora is labor-intensive, and we are always faced with a paucity of data in new domains and labeling paradigms of interest.
Moreover, the label assigned to an utterance depends on the current state of the dialogue (Stone, 2005) and prediction of an utterance's label benefits from referring to other utterances in context and their labels (Jaiswal et al., 2019).Deep learning models like RNNs and CNNs have proven effective tools to encode neighbouring utterances (Chen et al., 2018;Liu et al., 2017;Blunsom and Kalchbrenner, 2013;Bothe et al., 2018;Kumar et al., 2017).However such models rely on large annotated corpora that are prohibitively expensive to procure, especially for niche domains.
One recently popular method to overcome the dearth of supervised data in NLP is unsupervised pretraining over large unlabeled corpora.For ex-ample, Melamud et al. (2016); Peters et al. (2018); Devlin et al. (2018) use language modeling as an unsupervised task to learn word embeddings in context, and demonstrate remarkable improvements on a number of downstream NLP tasks.However, these methods learn representations for individual words, whereas for dialog analysis tasks, we need representations for utterances in the context of the entire dialog.
In this paper, we adapt the technique of learning contextualized representations using unsupervised pretraining to learn representations for utterances in the context of the dialogue.We first introduce a general model architecture consisting of a token, utterance, and conversation encoder.We then present a method to efficiently train this model by predicting the bag-of-word vectors of previous and next utterances over a large heterogeneous corpus of spoken dialogue transcripts.We quantify the effectiveness of learnt contextual utterance representations on two downstream utterance-labeling tasks: DA tagging and emotion recognition.We obtain competitive performance on two popular DA tagging tasks (SwitchBoard and ICSI Meeting Recorder) and an emotion labeling task (IEMO-CAP).Particularly, we observe significant improvements over training complex utterance tagging models from scratch for simulated low-resource settings for these tasks as well as for considerably smaller DA datasets such as LEGO and Map Task.

Methodology
We consider a large collection of conversations, where each conversation C is an ordered list of N utterances C = {u i , u 2 , ..., u N } and each utterance is a list of tokens, u i = {w 1 , w 2 , ...w |u i | }.Conversations may also have labels for every utterance: Y = {y 1 , y 2 , ..., y N } where each y i ∈ T , a finite set of labels expertly defined for a domain.

Contextualized Utterance Representations
We adopt a hierarchical encoder model consisting of a token encoder, an utterance encoder and a conversation encoder, followed by an output layer.The token encoder layer ENC tok encodes every token w j in utterance u i into a fixed-size embedding e tok w j , while the utterance encoder ENC utt encodes token embeddings of an utterance u i into a fixed-sized utterance representation e utt u i .For our specific instantiation, we combine both en- coders: we use the pretrained ELMo (Peters et al., 2018) model to encode the sequence of tokens in an utterance u i and take the final state of the forward and backward LSTMs (concatenated) as our utterance representation e utt u i , i = 1, 2, ..., N .We specifically choose ELMo because it is a strong general-purpose encoder and its character-based representations may be more robust to noise and OOV words in spontaneous conversations.This is followed by a conversation encoder ENC conv , which further converts this sequence of context-independent representations of utterances to a context-dependent sequence of utterance representations.For ENC conv , we use an architecture identical to the decoder variant of the Transformer (Vaswani et al., 2017) with N = 2 layers.We specifically choose the self-attentional Transformer for this purpose, as it is efficient to train, can easily capture long-distance dependencies over the entire conversation, and empirically outperformed other architectures such as LSTMs in preliminary experiments.The outputs, h u i , i = 1, 2, ..., N , of this hierarchical encoder of Figure 1 can be used as contextualized representations for utterances.
Predicting Utterance Bag-of-words In order to learn contextualized representations, the hierarchical encoder is trained to predict the bag-of-words of the previous and next utterances in the conversation using these representations.This training is done in the forward and backward direction respectively by allowing the self-attention layer of the transformer to only attend to earlier positions and later positions in the utterance sequence respectively (Figure 1).Hence, we learn contextual utterance embeddings in both directions:  h u i to predict the set of words in the neighboring utterance.u i−1 is reconstructed from ← − h u i and u i+1 from − → h u i .We use binary cross entropy (BCE) loss, where the target is a vocabulary-sized binary vector with words present in the utterance marked 1 and others 0. Notably this formulation reduces training time by relaxing word-order in the reconstruction loss, unlike other methods that predict words in order for surrounding utterances (Kiros et al., 2015).For utterances u i−1 and u i+1 with vocabulary vectors U i−1 and U i+1 ∈ {0, 1} |V | respectively, the bag-of-word loss for utterance u i is given by: where, (2)

Utterance Tagging
Once we have learned contextualized utterance representations, we can use them to predict the sequence of labels Y = {y 1 , y 2 , ..., y N }, such as dialogue acts, for utterances in the conversation.In this work we use a linear-chain conditional random field (Lafferty et al., 2001) as used in previous stateof-the-art models for DA tagging (Kumar et al., 2017;Chen et al., 2018) to predict one of the |T | tags for each u i , where the utterance is represented as the concatenation of the forward and backward contextualized vectors:

Experiments Pretraining Datasets and Hyperparameters
We train contextualized utterance representations on transcriptions of spontaneous human-human conversation corpora (Serban et al., 2015).We choose the corpora presented in Table 2 for this work.A majority of the conversations are dialogues, and utterances across all corpora are 10 words long on average.However, the chosen corpora have conversations of widely varying lengths (no. of utterances/conversation).For computation/memory efficiency, and also because more distant utterances likely have diminishing influence on discourse modeling, we divide each conversation into conversational snippets of length 64 1 by moving a 64-length window over the conversation with stride 1 and train the bag-of-word loss on each snippet thus obtained.For the conversational encoder, we use 2 layers of the transformer with 8 attention heads of 64 dimensions each.All feedforward networks use 2 layers with hidden size of 512.For training and fine-tuning, we use the Adam (Kingma and Ba, 2014) with learning rate 0.0001.
Tasks We evaluate performance of our model on these utterance-level tagging tasks: SwDA, the Switchboard Dialogue Act Corpus, annotates 1,155 telephonic conversations (224K utterances) with one of the 42 DAs in the DAMSL (Jurafsky, 1997) taxonomy.
LEGO, a subset (14K utterances) of the Lets Go bus-information dialogue system corpus (Raux et al., 2006) annotated with the ISO 24617-2 standard for conversation functions of task by (Ribeiro et al., 2016).
To simulate low-resource settings for the larger datasets like SWDA and MRDA, we experiment with different sizes of the training datasets and evaluate on the standard test set for these.For LEGO and MapTask, we use 10-fold cross validation.
Experimental Settings We use four different experimental settings to measure the efficacy of our 1 tuned model hyper-parameter pretrained utterance representations : No Context -With no conversational encoder (i.e.independently encoding every utterance using ELMo); Random Initialization -with the conversational encoder randomly initialized and trained on only the downstream tagging task; Freeze Network -the conversational encoder initialized using the model pretrained on our bag-of-word objective and kept fixed for downstream task; Pre-trained Initializationthe initialized conversation encoder fine-tuned on the downstream task.These settings are used to isolate the gains from using (1) contextualized representations, (2) pretraining them and then (3) finetuning them on the downstream task.

Result and Discussion
We observe that using pretrained utterance representations shows improved performance over random initialization and is competitive with existing state-of-the-art works by Kumar et al. (2017) for SwDA and MRDA, and Poria et al. (2017)   tures but are only trained on the task (Random initialization setting).From Figure 2, we observe that the pretraining-based initialization is especially helpful when the amount of training data is significantly reduced for SWDA, MRDA and IEMOCAP, over other experimental settings.The improved performance of the random initialization setting over fixing the pretrained conversational encoder parameters underscores the need to fine-tune for downstream tasks.Our pretrained model also outperforms random initialization and existing best results (Ribeiro et al., 2015;Sridhar et al., 2009) for truly low-resource datasets like LEGO and Map Task, as shown in Table 4.We also analyze the gain in accuracy by dialogue act category for the pretrained model over other experimental settings.We find that the pretrained model shows improvements in the categories listed in

Conclusion
We show that using large dialogue corpora to train contextualized utterance embeddings using a bag-of-word reconstruction loss is beneficial for utterance-level tagging in the low-resource setting, indicating that these embeddings learn useful and generalizeable properties of conversational discourse.Future work involves incorporating speaker identity, utterance duration and speech/prosody features.

Figure 2 :
Figure 2: Performance by training data sizes.SOTA: comparable state-of-the-art model trained on tagging task for entire dataset.

Table 1 :
Snippets of conversation with dialogue act tags."Yeah" is tagged differently in different contexts.
2, ..., N .We use an MLP followed by sigmoid function as the output layer over

Table 2 :
List of dialogue corpora for pretraining contextualized utterance representations for IEMOCAP that use similar hierarchical architec-

Table 3 :
SwDA DA categories that improve using pretrained utterance embeddings with % improvements in accuracy over other experimental settings.

Table 4 :
Results on LEGO and Map Task

Table 5 :
Dialogue Examples from SwitchBoard with dialogue acts as labelled under different experimental settings.The pre-trained network performs better on categories like Summarizing and Collaborative Completion