Reading Turn by Turn: Hierarchical Attention Architecture for Spoken Dialogue Comprehension

Comprehending multi-turn spoken conversations is an emerging research area, presenting challenges different from reading comprehension of passages due to the interactive nature of information exchange from at least two speakers. Unlike passages, where sentences are often the default semantic modeling unit, in multi-turn conversations, a turn is a topically coherent unit embodied with immediately relevant context, making it a linguistically intuitive segment for computationally modeling verbal interactions. Therefore, in this work, we propose a hierarchical attention neural network architecture, combining turn-level and word-level attention mechanisms, to improve spoken dialogue comprehension performance. Experiments are conducted on a multi-turn conversation dataset, where nurses inquire and discuss symptom information with patients. We empirically show that the proposed approach outperforms standard attention baselines, achieves more efficient learning outcomes, and is more robust to lengthy and out-of-distribution test samples.


Introduction
Reading comprehension has attracted much interest in the past couple years, fueled by avid neural modeling investigations. Given a certain textual content, the goal is to answer a series of questions based on implicit semantic understanding. Previous work has focused on passages like Wikipedia (Rajpurkar et al., 2016) or news articles (Hermann et al., 2015). Recently, dialogue comprehension in the form of cloze tests and multi-choice questions has also started to spur research interest (Ma et al., 2018;Sun et al., 2019). Different from passages, human-to-human dialogues are a dynamic and interactive flow of information exchange, which are often informal, verbose and repetitive. 1 This leads to lower information density and more topic diffusion, since the spoken content of a conversation is determined by two speakers, each with his/her own thought process and potentially distracting and parallel streams of thoughts.
To address such challenges, we propose to utilize a hierarchical attention mechanism for dialogue comprehension, which has shown to be effective in various natural language processing tasks (Yang et al., 2016;Choi et al., 2017;Hsu et al., 2018). The hierarchical models successively capture contextual information at different levels of granularity, leveraging coarse-grained attention to reduce the potential distraction in finer-grained attention but at the same time exploit finer-grained attention to distill key information for downstream tasks more precisely and efficiently.
While in document tasks sentences are the default semantic modeling unit at the coarse-grained level, utterances might be a more suitable counterpart in spoken dialogues, as dialogues often consist of incomplete sentences. However, a single utterance/sentence which usually implies information from one speaker is insufficient for grasping the full relevant context, as the interactive information from the interlocutor is often necessary. In multi-turn dialogues, each turn is one round of information exchange between speakers, thus making it a linguistically intuitive segment for modeling verbal communications. Thus, we postulate that for spoken dialogue comprehension, it is more effective to model conversations turn by turn using a multi-granularity design.
In this work, we introduce a hierarchical neu- Figure 1: Turn-based hierarchical architecture for dialogue comprehension: tokens in purple are the indicators of dialogue turns, and their indices are used to select question-aware hidden states (Green) for turn-level attention calculation. The turn with higher attentive score (Yellow) contributes more in scoring word-level attentions (Red).
ral attention architecture, integrating turn-level attention with word-level attention for multi-turn dialogue comprehension in a question-answering manner, where we evaluate performance on a corpus preserving linguistic features from real-world spoken conversation scenarios. In particular, we examine how our approach is able to address challenges from limited training data scenarios and from lengthy and out-of-distribution test samples.

Hierarchical Attention Architecture
The proposed architecture of modeling multi-level attention for dialogue comprehension is shown in Figure 1. The model design is based on extractive question answering, and consists of the following layers: a sequence encoding layer, a questionaware modeling layer, a turn-level attention layer, a word-level attention layer, and an answer pointer layer. We elaborate on the details below.

Sequence Encoding Layer
Given a t-length sequence of word embedding vectors S = {w 0 , w 1 , ...w t }, a bi-directional long short-term memory (Bi-LSTM) layer (Schuster and Paliwal, 1997) is used to encode S to a hidden representation, H = {h 0 , h 1 , ...h t } ∈ R t×d , where d is the hidden dimension. We obtain the content representation H c by encoding the dialogue sequence and concatenating the forward and backward information: and extracting the last hidden state of question encoding as the question representation h q .

Question-Aware Modeling Layer
We concatenate each step of H c with the question h q as in aspect-modeling , then obtain the question-aware modeling H via a Bi-LSTM layer. (2)

Turn-Level Attention Layer
We design the turn-level attention to score the dialogue turns explicitly, so the more salient turns will obtain higher scores, which is similar to (Hsu et al., 2018). However, instead of calculating the sentence-level attention using a separate recurrent component, we directly obtain the turn representations H turn by collecting hidden states from H with the turn-level segment position indices where m is the turn number of the dialogue content. More specifically, in a two-party conversation, each continuous utterance span between the speakers will be labeled as in one turn segment, and t turn i+1 − t turn i is the length of the ith turn. Then the turn-level attentive score is calculated via a dense layer and softmax normalization:

Word-Level Attention Layer
In our hierarchical architecture, to mitigate adverse effects of spurious word-level attention from words in less attended turns, we utilize turn-level salient scores to modulate word-level attentions. Thus, we broadcast each a turn i in A turn with its turn length to obtain A in dialogue length, and then multiply H with A to obtain the contextual sequence C . Then the word-level attention A word is calculated on C , and multiplied with H to obtain the contextual sequence C .

Answer Pointer Layer
Contextual sequences C , C and question h q are concatenated together and fed to a LSTM modeling layer. Then a dense layer with softmax normalization is applied for answer span prediction (Wang and Jiang, 2016).
where each p s /p e indicates the probability of being the start/end position of the answer span.

Loss function
Cross-entropy loss function is used as the metric between the predicted label and the ground-truth distribution. The total loss L total contains the loss from answer span (Wang and Jiang, 2016) and from turn-level attentive scoring similar to (Hsu et al., 2018), with a weight λ ∈ [0, 1].

Corpus & Data Processing
Dialogue Dataset: We evaluated the proposed approach on a spoken dialogue comprehension dataset, consisting of nurse-to-patient symptom monitoring conversations. This corpus was inspired by real dialogues in the clinical setting where nurses inquire about symptoms of patients (Liu et al., 2019). Linguistic structures at the semantic, syntactic, discourse and pragmatic levels were abstracted from these conversations to construct templates for simulating multi-turn dialogues. The informal styles of expressions, including incomplete sentences, incorrect grammar and diffuse flow of topics were preserved. A team of linguistically trained personnel refined, substantiated, and corrected the automatically simulated dialogues by enriching verbal expressions through different English speaking populations in Asia, Europe and the U.S., validating Figure 2: Examples of segmented turns in our corpus. The default segmented turn is an adjacency pair of utterances from two speakers (Yellow). To ensure a turn spans across semantically congruent utterances, neighboring utterances could be merged according to a set of rules derived from spoken features, like n-gram repetition (Green), back-channeling (Blue), self-pause (Red) and interlocutor interruption (Gray). logical correctness through checking if the conversations were natural, reasonable and not disobeying common sense, and verifying the clinical content by consulting certified and registered nurses. These conversations cover 9 topics/symptoms (e.g. headache, cough). For each conversation, the average word number is 255 and the average turn number is 15.5. Turn Segmentation: In a smooth conversation, one turn is an adjacency pair of two utterances from two speakers (Sacks et al., 1974). However, in real scenarios, the conversation flow is often disrupted by verbal distractions such as interlocutor interruption, back-channeling, self-pause and repetition (Schlangen, 2006). We thus annotated these verbal features from transcripts of the realworld dialogues and integrated them in the templates, which are used to generate the simulated dialogue data. We subsequently merged the adjacent utterances from speakers considering the features and the intents to form turns (see Figure 2). This procedure ensures semantic congruence of each turn. Then the segment indices of turns were labeled for turn-level context collection. Annotations for Question Answering: For the comprehension task, questions were raised to query different attributes of a specified symptom; e.g., How frequently did you experience headaches? Answer spans in the dialogues were labeled with start and end indices, and turns containing the answer span were annotated for turnlevel attention training.

Baseline Models
We implemented the proposed turn-based hierarchical attention (HA) model, and compared it with several baselines: Pointer LSTM: We implemented a Pointer network for QA (Vinyals et al., 2015). The content and question embedding are concatenated and fed to a two-layer Bi-LSTM, then the answer span is predicted as in Section 2.5. Bi-DAF: We implemented the Bi-Directional Attention Flow network (Seo et al., 2017) as an established baseline, which fuses question-aware and context-aware attention. R-Net: We implemented R-Net (Wang et al., 2017), another established baseline, which introduces self-attention to implicitly model multilevel contextual information. Utterance-based HA: To evaluate the effectiveness of turn-level modeling, we implemented an utterance-based model as the control, by treating every utterance as a single segment.

Training Configuration
Pre-trained word embeddings from Glove (Pennington et al., 2014) were utilized and fixed during training. Out-of-vocabulary words were replaced with the [unk] token. The hidden size and embedding dimension were set to 300. Adam optimizer (Kingma and Ba, 2015) was used with batch size of 64 and learning rate of 0.001. For the modeling layers, dropout rate was set to 0.2 (Srivastava et al., 2014). The weight λ in the loss function was set to 1.0. During training, the validationbased early stop strategy was applied. During prediction, we selected answer spans using the maximum product of p s and p e , with a constraint such that 0 ≤ e − s ≤ 10.

Evaluation: Comparison with Baselines
Evaluation was conducted on the dialogue corpus described in Section 3.1, where the training, validation and test sets were 40k, 3k and 3k samples of multi-turn dialogues, respectively. We adopted  Exact Match (EM) and F1 score in SQuAD as metrics (Rajpurkar et al., 2016). Results in Table 1 show that while the utterance-based HA network is on par with established baselines, the proposed turn-based HA model obtains more gains, achieving the best EM and F1 scores.

Evaluation in Low-Resource Scenarios
Limited amount of training data is a major pain point for dialogue-based tasks, as it is timeconsuming and labor-intensive to collect and annotate natural dialogues at a large-scale. We expect the hierarchical structure to result in more efficient learning capabilities. We conducted experiments on a range of training sizes (from 3k to 40k) with a fixed-size test set (3k samples). As shown in Figure 3, the turn-based HA model outperforms all other models significantly when the training set is smaller than 20k.

Lengthy Sample Evaluation
Spoken conversations are often verbose with low information density scattered with topics not central to the main dialogue theme, especially since speakers chit-chat and get distracted during taskoriented discussions. To evaluate such scenarios, we adopted model-independent ADDSENT (Jia and Liang, 2017), where we randomly extracted sentences from SQuAD and inserted them before or after topically coherent segments. The average length of the augmented test set (3k samples), increased from 255 to 900. As shown in Table 2, the proposed turn-based model compares favorably when modeling lengthy dialogues.

Out-of-Distribution Evaluation
Another evaluation was performed on an augmented set of dialogue samples, by adding three out-of-distribution symptom entities (bleeding, cold/flu, and sweating) to the corresponding conversations (3k samples). This was conducted on the well-trained models in Section 3.4. As shown in Table 3, the proposed turn-based HA model is the most robust in answering questions related to unseen symptoms/topics while till performing well on in-domain symptoms, thus showing potential generalization capabilities.
In summary, our overall experimental results demonstrate that the proposed hierarchical method achieves higher learning efficiency with robust performance. Moreover, the turn-based model significantly outperforms the utterance-based one, empirically verifying that it is appropriate to use turns as the basic semantic unit in coarse-grained attention for modeling dialogues.

Related Work
Machine comprehension of passages has achieved rapid progress lately, benefiting from large-scale datasets (Rajpurkar et al., 2016;Kocisky et al., 2018), semantic vector representations (Pennington et al., 2014;Peters et al., 2018;Devlin et al., 2019), and end-to-end neural modeling (Wang et al., 2017;Hu et al., 2018). The attention mechanism enables neural models to more flexibly focus on salient contextual segments (Luong et al., 2015;Vaswani et al., 2017), and is further im-proved by hierarchical designs for document processing tasks (Yang et al., 2016;Choi et al., 2017). Multi-level attention could be fused in hidden representations (Wang et al., 2017) or calculated explicitly (Hsu et al., 2018). There is an established body of work studying how humans take turns speaking during conversations to better understand when and how to generate more natural dialogue responses (Sacks et al., 1974;Wilson et al., 1984;Schlangen, 2006). Utterance-level attention has also been applied to context modeling for different dialogue tasks such as dialogue generation (Serban et al., 2016) and state tracking (Dhingra et al., 2017). Recently, there is emerging interest in machine comprehension of dialogue content (Ma et al., 2018;Sun et al., 2019). To the best of our knowledge, our work is the first in exploiting turn-level attention in neural dialogue comprehension.

Conclusion
We proposed to comprehend dialogues by exploiting a hierarchical neural architecture through incorporating explicit turn-level attention scoring to complement word-level mechanisms. We conducted experiments on a corpus embodying verbal distractors inspired from real-world spoken dialogues that interrupt the coherent flow of conversation topics. Our model compares favorably to established baselines, performs better when there is limited training data, and is capable of addressing challenges from low information density of spoken dialogues and out-of-distribution samples.