Modeling Local Contexts for Joint Dialogue Act Recognition and Sentiment Classification with Bi-channel Dynamic Convolutions

In this paper, we target improving the joint dialogue act recognition (DAR) and sentiment classification (SC) tasks by fully modeling the local contexts of utterances. First, we employ the dynamic convolution network (DCN) as the utterance encoder to capture the dialogue contexts. Further, we propose a novel context-aware dynamic convolution network (CDCN) to better leverage the local contexts when dynamically generating kernels. We extended our frameworks into bi-channel version (i.e., BDCN and BCDCN) under multi-task learning to achieve the joint DAR and SC. Two channels can learn their own feature representations for DAR and SC, respectively, but with latent interaction. Besides, we suggest enhancing the tasks by employing the DiaBERT language model. Our frameworks obtain state-of-the-art performances against all baselines on two benchmark datasets, demonstrating the importance of modeling the local contexts.


Intorduction
Dialogue act recognition (DAR) aims to detect speaker's intentions (e.g., question, agreement or statement) in each utterance, which can facilitate dialog systems to produce appropriate responses (Inui et al., 2001). Recent studies have further revealed that simultaneously recognizing the dialog act and detecting the sentiment in dialog can result in better grasping of speaker's intention (Cerisara et al., 2018;Qin et al., 2020). These two tasks are closely relevant, i.e., they mutually promote each other by being jointly performed. On the one hand, the DAR provides clues for sentiment classification (SC). In return, the sentiment transitions also can benefit dialogue act prediction. Taking the utterances in Table 1 as examples, it is quite common that a same sentiment following previous utterance's will be expressed once the dialogue act Agreement is assigned. Meanwhile, when the speaker changes the sentiment from Negative to Neutral, the dialogues act tends to transition into Statement.

Speaker Utterance
Dialogue Act Sentiment A Does anyone ever feel anxious and empty at the same time? Prior works model the joint DAR and SC as sequence labeling problem, all accomplishing with recurrent-like neural models, e.g., Long Short-Term Memory Network (LSTM) (Chen et al., 2018;Raheja and Tetreault, 2019). However, one crucial drawback in these models is failing to fully incorporate  Figure 1: The shadow boxes are the source context for generating the corresponding kernels. Dynamic convolution (a) of each head at time-step i generates a kernel with only current i-th input, while our proposed context-aware dynamic convolution (b) computes kernels within a wide context window.
the local contextual information among the dialog. Intuitively, for current utterance, its nearer dialogue neighbors are always more influential and informative than the remoter ones, due to the closer replying relationships between them. This can be exemplified by the two cases we mentioned above, where the inter-impacts between dialogue acts and sentiments more often occur within adjacent utterances. Therefore, in this paper we target fully capturing the local contexts for improving the joint tasks.
Convolutional Neural Networks (CNN) are the preferred alternatives on effectively extracting localfeature in discourses in natural language processing (NLP) community (Kalchbrenner et al., 2014;Kim, 2014). The recent advanced CNN variants, dynamic convolutions (Wu et al., 2019), have been proposed for bringing enhanced capabilities. As illustrated in Figure 1(a), it operates by multiple heads of convolutions with the shared dynamic kernel over dimension, which enables to dynamically generate different convolution kernels for different input elements at each time-step. Compared with vanilla CNN, the dynamic convolution network is much more flexible on mining the contextual features, and meanwhile effectively reduces model parameters (Kaiser et al., 2018). In this work, we consider taking advantage of the dynamic convolution network as the utterance context extractor for the joint tasks. However, we can notice that the dynamic convolutions generate kernels merely under current input utterance while ignoring the surroundings of the utterance, which may be problematic when encoding the utterance feature representation. Therefore, we further adapt the dynamic convolutions by designing a context-aware dynamic convolution, as in Figure 1(b). Our context-aware dynamic convolutions dynamically calculate kernels under the guidance of wider utterance contexts, which allows to better filter or integrate dialogue act and sentiment from various heads, and yield more informative contextual representation in dialogue.
We take the dynamic convolution network (DCN) and our context-aware dynamic convolution network (CDCN) as the utterance context encoders, respectively. As shown in Figure 2(a), to satisfy the joint DAR and SC tasks, we extend the models into bi-channel version (i.e., BDCN or BCDCN) under a multitask learning framework, with each channel for each subtask. Two channels can separately learn their own feature representations but with latent interactions. First, the bi-directional LSTM (BiLSTM) layer encodes the input utterance into representations. Then, in utterance encoder layer, multi-layer BDCN (or BCDCN) captures the context representations for DAR and SC, respectively. Finally, after linear transformation, our model makes final predictions for two tasks separately. In addition, we consider exploiting the BERT pre-trained contextualized representations, which have been demonstrated to bring nearly 10% improvements for the joint DAR and SC tasks on Mastodon data (Qin et al., 2020). However, one entire dialog often consists of far more than two utterance sentences, while the BERT restricts the input with at maximum two sentence pieces, which consequently limits the utility. We thus adopt the DiaBERT (Liu and Lapata, 2019) to yield the enhanced utterance representations at dialogue level.
We experiment on two benchmarks, including Mastodon (Cerisara et al., 2018) and DailyDialog (Li et al., 2017). The results show that both our BDCN and BCDCN systems can beat the current best baseline with large margins for the joint DAR and SC tasks. Especially our BCDCN model achieves the state-of-the-art performances, demonstrating its advances. With the DiaBERT, the performances can be significantly boosted when data is sufficient (e.g., in DailyDialog). Further analysis shows the necessity of capturing the local contexts for the joint tasks. In summary, we make following contributions. 1) We are the first proposing to improve the joint dialog act recognition and sentiment classification by fully capturing the utterance local contexts. 2) We employ the dynamic convolutions network for better encoding local contexts of utterances, based on which we further propose a novel context-aware dynamic convolutions network for enhancement. 3) Our proposed frameworks achieve the state-of-the-art performances against the current best baselines on two tasks in two datasets. 4) We obtain significantly reinforced results by employing the DiaBERT and BERT language models.

Related Work
Dialogue Act Recognition (DAR) plays an important role in building intelligent and interactive conversation systems and producing appropriate responses. Existing researches for DAR can be summarized into two categories. Initial works make a prediction for each dialogue utterance independently by traditional statistical classifiers with hand-crafted discrete features (Stolcke et al., 2000;Keizer et al., 2002;Surendran and Levow, 2006;Lendvai and Geertzen, 2007;Tavafi et al., 2013). Later, researchers treat DAR as a sequence labeling problem and apply neural network methods to reach significant improvements (Kalchbrenner and Blunsom, 2013;Chen et al., 2018;Kumar et al., 2018;Raheja and Tetreault, 2019;Colombo et al., 2020). On the other hand, capturing the sentiment polarity or opinions according to the texts, aka. sentiment classification has long been a heated research topic in NLP community (Ren et al., 2016;Amplayo et al., 2018;Fei et al., 2019;Fei et al., 2020b). It has been revealed recently that the dialogue act recognition and sentiment classification can be closely related, and the joint learning of these two tasks is more beneficial (Cerisara et al., 2018;Qin et al., 2020). In this work, we follow these works by employing the joint scheme of the two tasks. Differently, we aim to better capture the local contexts of dialogue utterances for task improvements.
This work also relates to the CNN models. CNN is prominent on retrieving the local features among texts or discourses (Kim, 2014;Zhang et al., 2015;Lei et al., 2015), functioning by capturing the n-gram patterns within input contents (Kim, 2014). A large body of studies considers retrofit CNN to yield better task performances. For example, Kaiser et al. (2018) apply depth-wise separable convolutions to neural machine translation, which perform convolutions independently over every channel dimension. Based on the work of Kaiser et al. (2018), Wu et al. (2019) introduce dynamic convolutions. They define a number of convolution heads where each head shares kernel over every dimension computed at each time step. In this work, we take advances of the CNN-like models for sufficiently mining the local context information in dialogue utterances. Based on the dynamic convolutions, we newly propose context-aware dynamic convolutions, which is expected to give the enhanced capability of local feature extraction for our task.

Framework
We cast both the DAR and SC as a sequence labeling problem. Given a dialogue C with a sequence of utterances U = {u 1 , u 2 , . . . , u T }, where T is the length of the sequence, the objective of DAR is to predict the corresponding dialogue act labels In SC, the goal is to determine the utterance sequence to the corresponding sentiment labels Y s = {y s 1 , y s 2 , . . . , y s T }. Our framework is built based on either the dynamic convolution network (DCN) or context-aware dynamic convolution network (CDCN). We achieve the joint tasks of DAR and SC by extending these networks into bi-channel version (i.e., BDCN or BCDCN), where each subtask takes one channel but with latent interactions. These two tasks can be jointly trained under the multi-task learning framework.
As illustrated in Figure 2(a), the overall architecture of our framework consists of three tiers. At the input layer, the BiLSTM encodes the input utterance texts into representations. Then at the utterance encoder layer, the stacked multi-layer BDCN (or BCDCN) captures the context representations for DAR and SC, respectively. Finally, at the output layer, the model predicts labels for two tasks separately.

Input Layer
We denote the word sequence in t-th utterance u t as X t = {w t,1 , w t,2 , . . . , w t,r }, where r is the length of the sequence. We first map the surface words X t into vectorial representation X t via a look-up table.
(a) Joint model for DAR and SC.
(b) The structure of DiaBERT language model

Transformer Layers
.  We then adopt BiLSTM to encode the embeddings and obtain the input utterance representations: where h t is the desired input utterance representations for u t . The order information of the utterance sequence is quite crucial to our task. We thus consider adding the sinusoidal position embedding (Vaswani et al., 2017) to encode the absolute position of each utterance. We concatenate the input utterance representation with this position embedding for each utterance.
where PE(·) denotes the desired representation from positional encoding function.

Utterance Encoder Layer
As we described earlier that the nearer neighbors of an utterance u t can be more informative than farther ones, we consider fully modeling the local contexts via convolution networks. We use either the dynamic convolution network (DCN) or context-aware dynamic convolution network (CDCN) as our utterance encoder, respectively. In what follows we first elaborate the technical details of DCN and CDCN separately. Then we demonstrate how to integrate DAR and SC jointly with two channels.
Dynamic convolution network. Compared with vanilla text convolutions (Kim, 2014), dynamic convolutions can dynamically generate varying convolution kernels for different input elements at each time-step, being more flexible on mining the contextual features. Dynamic convolution is based on the depth-wise separable convolution (Kaiser et al., 2018), where a convolution is performed independently over each input dimension. Based on the t-th utterance, the input of the convolution with the k window width (i.e., kernel size of k = 2k + 1) is denoted as C t = [h t−k , · · · , h t+k ]. Then the depth-wise separable convolution (SConv) operates as: where W c is the kernel (i.e., convolution weight), o t is the output representation from the depth-wise separable convolution, d is the dimension of the hidden state, c t,i is the dimension i of C t . In dynamic convolutions (DConv), each head of the convolutions is based on the depth-wise separable convolution, while tying the convolution weights over the dimension of multiple heads (i.e., N heads), and learning the weights dynamically over time. This mechanism extends a similar spirit to the recent multi-head self-attention (Vaswani et al., 2017), where multiple heads can jointly capture latent features from different representation subspaces. The calculation in dynamic convolutions can be formulated as: where ⊕ denotes the concatenation operation, W n is learnable parameters. W DConv is the kernel in n-th convolution head dynamically generated by the kernel generation function KerGen(·) in Eq. 6.
Context-aware dynamic convolution network. Instead of using the current t-th utterance representation h t for producing the kernels as in dynamic convolution, we propose the context-aware dynamic convolution (CDConv) to generate kernels (i.e., in Eq 6) with wider contexts.
We can abstract all the above convolution operations, and encapsulate them into a layer: whereĥ t is the corresponding output representation for t-th utterance. We can take either DConv or CDConv as the underlying convolutions in ConvLayer, referred to as DCN or CDCN.
Multi-layer bi-channel convolution network for joint DAR and SC. In our practice, we first stack the DCN or CDCN with multiple layers, so as to expand the receptive field for fully extracting the context features among the dialogue level. For l-th layer, the t-th utterance representation h (l) t is updated as: h where LayerNorm(·) is a layer normalization operation (Ba et al., 2016). Next, inspired by the multi-channel CNN (Kim, 2014), we use two same separate convolution networks as two channels to model the DAR and SC parallel (namely, BDCN and BCDCN), as shown in Figure 2(a). Two channels perform learning separately but with latent interactions, such that the two tasks are expected to propagate information from different aspects, and aggregate them with dynamic convolution operations. To reach this, we let the input representation of each channel at l-th layer as the concatenation of the output representations of the both two channels at the last layer.
where the superscript * ∈ {d, s} presents either the channel for DAR (d) or for SC (s), alternatively.

Output Layer
At the last layer (i.e., L-th layer), we separately apply two linear layers concatenated with softmax functions to calculate the output probability for the two tasks, respectively: y * t = Softmax(Linear(h L, * t )) .

Learning
We optimize the model by minimizing the negative cross-entropy between the predictions y * and the gold labelsŷ * for both DAR and SC: We perform joint learning of two tasks by combing the above loss for them (i.e., L d and L s ). Specifically, instead of using constant coupling co-efficiency, we employ the homoscedastic uncertainty strategy (Kendall et al., 2018) for learning weights automatically during the training: where σ d and σ s are the variances of DAR loss and SC loss over training instances, respectively.

DiaBERT: Pre-trained Contextualized Utterance Representation
In §3.1 we use a look-up table for getting the initial word embedding. Pre-trained contextualized word representations from BERT language model (Devlin et al., 2019) have brought great benefits to a wide range of downstream NLP tasks (Jia et al., 2019;Fei et al., 2020a). The very recent work (Qin et al., 2020) demonstrates that using BERT brings large-scale task improvements for DAR and SC. In this work we intend to borrow these advances from BERT as well. Nevertheless, original BERT limits the input  with maximum two sentence pieces, while often one dialog can comprise far more than two utterance sentences. Directly using BERT can lead to discourse information incoherence.
To this end, we leverage a dialogue-level (discourse-level) BERT-based encoder DiaBERT (Liu and Lapata, 2019) to take the whole dialogue and output the complete representation. In BERT there can be only one [CLS] token for splitting at most a pair of utterance sentences, while DiaBERT treats each utterance as a segment, each with a [CLS] token at the start of the utterance for distinguishing different utterance, as illustrated in Figure 2(b). DiaBERT entails token-level and dialog-level output features.
where h DB denotes the output for input utterance X. So we can either make use of the word representation from DiaBERT, or take the one from each [CLS] token as the corresponding utterance representation.

Dataset and Evaluation
We evaluate our methods on two benchmark datasets, Mastodon 2 and DailyDialog 3 . Table 2 shows the detailed statistics. Specifically, since no developing set is in Mastodon, we randomly split 10% of the training set dialogues. Following Qin et al. (2020), we use the macro-averaged Precision (P), Recall (R) and F1 as the major metrics measuring two tasks on DailyDialog. On Mastodon, following Cerisara et al. (2018), we ignore the neutral sentiment label in SC, and for DAR we adopt the average of the F1 scores weighted by the prevalence of each dialogue act. For all experiments, we pick the model that performs best on developing set, and all the results on the test set are presented on average after 20 times running.

Hyper-parameter and Resource
The BDCN and BCDCN are set to 3 layers for best performances according to preliminary experiments. We adopt Adam as the optimizer with an initial learning rate of 5e-4 with weight decay of 1e-5. The minibatch size is 16. To alleviate overfitting, we use a dropout rate of 0.5 on the input layer and the output layer. For fair comparisons, we randomly initialize word embedding without pre-training, following Qin et al. (2020), and the dimension is selected in [100,200,300,400]. Our DiaBERT shares the same architecture with the official BERT 4 (Base version) and is pre-trained in the same way.

Baseline System
We divide the state-of-the-art baseline models into three categories. 1) The models for DAR, including HEC (Kumar et al., 2018), CRF-ASN (Chen et al., 2018) and CASA (Raheja and Tetreault, 2019).

Developing Experiment
We first conduct developing experiments for our BDCN and BCDCN encoders based on the developing sets of two datasets, to study the best hyper-parameters of k and m. Figure 3 plots the varying results.   Table 3: Main performances. Avg. represents the averaged F1 score over SC and DAR on the dataset. All the scores of baseline models are reprinted from Qin et al. (2020) without any modification.
BDCN and BCDCN perform the best when kernel sizes k are both set as [3,5,7] on both Mastodon and DailyDialog datasets. Meanwhile, keeping the best k as [3,5,7] for BCDCN, the utilities are the best when the context window m of kernel generation is set as [3,5,7] on Mastodon, and [3,3,3] on DailyDialog. We keep these configurations for BDCN and BCDCN for the following experiments.

Main Result
We compare the performances by all systems under 1) the separate learning of two tasks of SC and DAR, and 2) the joint learning of two tasks, respectively. In Table 3 we show the main results on Mastodon and DailyDialog datasets, respectively, from which we can have the following observations.
First of all, on both two datasets, the models under joint task learning universally outperform these under separate learning of two single tasks, which can be informed by the comparisons between the results from DCN/CDCN and those from BDCN/BCDCN. This verifies the necessity of conducting joint training of DAR and SC. Among the models by separate task learning, our DCN and CDCN perform much better than baselines, demonstrating the effectiveness of capturing the local contexts for the tasks.   Second, among joint learning of two tasks, our BDCN consistently shows better results than the current state-of-the-art model (i.e., DCR-Net). Especially our BCDCN achieves the best F1 scores, with an average 52.7% on Mastodon and 64.5% on DailyDialog. This again proves the advances in the local feature extraction for DAR and SC. Besides, we find that improvements by our BCDCN (than best baseline) are overall higher on DailyDialog (2.2%=64.5-62.3) than on Mastodon (0.8%=52.7-51.9).
Third, we see that with the help of pre-trained contextual language models (i.e., BERT and DiaBERT), the performances on both the separate task learning and the joint tasks can be greatly boosted. Notably, the enhancements are much significant on Mastodon than on DailyDialog. For example, with BERT, our BCDCN gives increase by 12.4%(65.1-52.7) F1 on Mastodon, and 3.4%(67.5-64.1) on DailyDialog. It is largely due to that the small size of the Mastodon data can give limited supervision signals for training, and in this case the external information can be greatly beneficial. Also, with language models, BCDCN even gains much higher F1 scores than DCR-Net. Last but not least, similar to BERT, DiaBERT can bring great improvements for the tasks, where however the help for DailyDialog are more significant than for Mastodon. This largely lies in the differences of the data scale provided for fine-tuning.

Ablation Study
We now take a further step, studying our model under various configurations. We conduct ablation experiments for the BCDCN model based on DailyDialog data, as shown in Table 4. First, without the position embeddings at the input layer, we see the slight performance drops. Next, we focus on the convolution part of our model. Further, we cancel the latent interactions between DAR and SC by taking h (l−1), * t instead of h (l−1),d t ⊕ h (l−1),s t (in Eq. 11) as * channel's input in l-th layer, separately. Removing the dynamic kernel of BCDCN model also results in minor drops. When replacing the context-aware dynamic convolution with 1) vanilla text convolution, 2) depth-wise convolution and 3) dynamic convolution, respectively, we can observe varying degrees of performance decreases. In particular, by employing advanced convolution rather than the vanilla text convolution, the number of model parameters can be greatly reduced. Significant reductions for both two tasks can consequently be witnessed. Finally, using the fixed coupling co-efficiency for the multi-task training loss (i.e., 0.5L d + 0.5L s ), rather than the adopted homoscedastic uncertainty strategy, can result in suboptimal learning performances.

The Impact of Data Scale for Fine-tuning Language Model
In Table 3 we notice that the DiaBERT can outperform BERT on the tasks only where comparatively larger data for fine-tuning is offered (i.e., DailyDialog). Otherwise, the improvements become inferior to that by BERT (i.e., Mastodon). Here we explore the influence that the DiaBERT is subject to data scarcity. Based on DailyDialog data, we vary the training set for fine-tuning DiaBERT and BERT, and test the F1 scores for DAR and SC by BCDCN. From the trends in Figure 4, we see that the performances with BERT are better than that with DiaBERT when the used training data are less than 20%. In addition, DiaBERT gains more improvements than BERT when training signals are abundant.   ). The Current input for models is the 3-rd utterance. Each x label (e.g., s→d) indicates the attentive interaction (by Eq. 6) of one channel (e.g., SC) to another channel (e.g., DAR).

Visualization on Bi-channel Dynamic Convolution Mechanism
Finally, we empirically visualize the convolution weights to better understand the mechanism of the bichannel dynamic convolution in BDCN and BCDCN encoders. Specifically, we illustrate the dynamic kernel weights with 3-rd utterance as the current time step, under the attentive interaction (by Eq. 6) of one channel to another channel, as in Figure 5. Firstly, we can learn that the dynamic kernels can help to retrieve informative clues for the corresponding tasks, which is inferred by those highlighted weights in different heads and channels of convolutions in both two encoders. Each channel actively pays more attention to its own responsibility, respectively. More importantly, the dynamic convolutions in different channels can work collaboratively with another channel. For example, the highly-weighted kernel of the heads for utterance in the DAR channel, as highlighted by the red box in Figure 5(a) (right), where the kernel shows attention to both SC and DAR channels (i.e., d→s), with the previous utterance's dialogue act Question and the current sentiment Negative, the model is consequently informed to transitioning the dialogue act from Question to Answer. This indicates that our proposed interactive bi-channel mechanism is effective for coordinating the joint tasks with latent interaction. Besides, we see that there are more activated highlighted-weights in BCDCN than in BDCN. With wider local contexts modeled by dynamic kernel generation mechanism, BCDCN can be more prominent on mining informative clues for the joint DAR and SC tasks.

Conclusion
In this work, we proposed to improve the joint dialogue act recognition (DAR) and sentiment classification (SC) by fully modeling the local contexts of utterances. Based on the dynamic convolution network, we proposed a context-aware dynamic convolution network for better leveraging the local dialogue contexts when generating convolution kernels. We extended our frameworks into bi-channel version under multi-task learning for the joint DAR and SC. Our models obtained state-of-the-art performances on two benchmark datasets, demonstrating the significance of modeling the local contexts for the joint task.