Modeling Temporality of Human Intentions by Domain Adaptation

Categorizing patient’s intentions in conversational assessment can help decision making in clinical treatments. Many conversation corpora span broaden a series of time stages. However, it is not clear that how the themes shift in the conversation impact on the performance of human intention categorization (eg., patients might show different behaviors during the beginning versus the end). This paper proposes a method that models the temporal factor by using domain adaptation on clinical dialogue corpora, Motivational Interviewing (MI). We deploy Bi-LSTM and topic model jointly to learn language usage change across different time sessions. We conduct experiments on the MI corpora to show the promising improvement after considering temporality in the classification task.


Introduction
Motivational Interviewing (MI) (Miller and Rollnick, 2012) is a collaborative communication style used to address a variety of health problems such as alcohol and drug use. Accurately understanding the patient's intentions to change from his/her speech during the session could greatly enhance the efficacy of MI. Motivational Interviewing Skill Code (MISC) is a coding system that captures client language, specifically change talk (CT) and sustain talk (ST) (Miller et al., 2008). However, reliable MISC coding is labor-intensive and requires domain expertise. Recent computational annotation methods have been proposed to automatically classify patients' behaviors within MI (Xiao et al., 2016;Pérez-Rosas et al., 2017;Gibson et al., 2017). To this end, Recurrent Neural Networks (RNN) that capture sequential information are applied for the classification of patient behavior.
Recent research shows that themes and words within a conversation change across time (Dufour et al., 2016). Similarly within MI, topics and the patient's attitude towards their willingness to change might shift. Within this work, we investigate how shifts in themes across time affects performance of the intention classifications for the dialogue.
Specifically, we focus on the patient intent classification task and propose a method that adopts the temporal factor by domain adaptation to improve performance of the classifiers. We evaluate our approach on a dataset of college alcoholism (Carey et al., 2009;Borsari et al., 2012), containing transcripts of MI conducted with U.S. college students. Specifically, we first explore the theme shift and give a brief analysis by topic modeling (Blei et al., 2003). We then utilize Bi-directional Long Short-Term Memory (Bi-LSTM) (Graves and Schmidhuber, 2005) to encode utterances from both word and topic embeddings. Next, we concatenate both contextual information with the encoded utterance representations. Finally, we jointly train a unified representation of utterance by domain adversarial training and patient intent classification. We show that this approach can lead to improvements in classification performance.

Dataset
We conduct our experiments on a clinical dataset of college student alcoholism (Carey et al., 2009;Borsari et al., 2012), where we obtain 193 MI transcripts with a total of 83677 utterances. Each of the MI session ranged between 60 and 90 minutes. Each client utterance was coded using the MISC. In this paper, we focus on classifying patient behavior on the utterance-level. Specifically, we classify patient behavior on collapsed MISC annotation codes into with three categories: "CT": Change talk indicates utterances that reflect motivating factors related to change; "ST": Sustain talk indicate the patient has no intentions to change; "FN": Follow neutral means there is no indication of patient inclination. An example conversation snippet, highlighting all three sources of information is provided in Table 1. The intention labels (o+3, o-3) are only available for patients, whose '+' and '-' refer to change vs sustain talk (CT vs ST) and the number measures the "strength of client language," which represents a subjective assessment by human annotators, and the 'quo' and 'quc' refer to "open question" and "closed questions", which are only for interventionist (see (Borsari et al., 2015) for details regarding the coding strategy). While the MISC codes of client utterances within MISC are more complex and comprise other types of annotations, we focus on human intention modeling (i.e., CT vs. ST vs. FN) only.
How the theme of dialogue shift overtime? We qualitatively examined how the distribution of content changes across different time stages. To measure the distribution of content, we trained a topic model with 10 topics using Gensim (Řehůřek and Sojka, 2010) with default parameters. The data doesn't have associated timestamps, thus we empirically split each MI transcript by the number of patient utterances equally into three time stages, stage 1, stage 2 and stage 3. We calculated the proportion of each topic within the same time period by take the average of all transcripts. We then normalized the topic distributions and finally visualize the extent to which distributions of the 10 topics varies by time.
We can observe the varied topic distributions across different stages of conversations, where the topic distributions are plotted from the bottom to the top. There are some topics have more variations, such as topic 4, and some topics are very stable such as topic 1 1 . Recent research shows the performance of classification tasks might be impacted by the temporal character of language (Huang and Paul, 2018). Thus, it might be desirable to model the temporality in the computational classifiers.

Model
The architecture of the proposed model is shown in Figure 2. We feed four types of information to the model: topic-and word-level data of the utterance (content), preceding interventionist verbal behavior (context) and prior MISC annotations of utterances (MISC). Particularly, we empirically extracted previous 5 utterances as context and 10 previous codes as MISC 2 , where we set "unk" as the default.
Embeddings. We built two types of embeddings, word embedding and topic embedding. We created word embeddings from Googles pretrained Word2Vec (Mikolov et al., 2013) and created topic embeddings from a trained LDA (Blei et al., 2003) specific to the corpus. We treated each MISC as one document and trained an embedding model.
We apply Bidirectional LSTM (Bi-LSTM) (Graves and Schmidhuber, 2005) on the inputs. Dropouts (Srivastava et al., 2014) are applied on the outputs of Bi-LSTM. We merge the outputs by concatenation and feed the outputs to the dense layer to learn a unified representation of the utterance.
Joint Learning. We apply domain adversarial training (Ganin et al., 2016) only on the topic inputs from learned topic representations. Our intuition is that the topic distributions across different stages of the MI session could track the variations of patients' intents. We empirically split the conversation into three time stages: Stage 1-3 (i.e, beginning, middle, and end). The goal of domain adversarial training is converted to a time stage prediction task, which aims to differentiate topic themes both locally and globally. We used one-hot encoding to represent labels of the prediction tasks. We deploy softmax functions for both time stage and intention predictions. We use categorical cross entropy to jointly optimize the training process of the two classification tasks: domain classification and patient intent classification.

Experiments
Each utterance is lowercased and tokenized by NLTK (Bird et al., 2009). We filter out the utterances that are shorter than 5 tokens and then remove punctuations. Finally, we obtain 22432 pa-tient utterances. The dataset is stratified and split into training set (80%), validation set (10%) and testing set (10%), as shown in Table 2. We train our models on the training set and run grid search to find the optimal parameters on the validation set by the weighted F1 score. The details of optimized parameters are listed as follows. The models were trained for 15 epochs with a batch size of 64. Each utterance and its context are padded to 50 words. The utterance's previous MISC codes are padded to 10. We pad the sequences with an "unknown"-token. The size of LSTMs was tuned in the range of  (Hinton et al., 2012) or Adam (Kingma and Ba, 2014) with a fixed learning rate of 0.001. Finally, we empirically set the loss weight of the domain adversarial training to 0.05.
We trained the topic model on the MI corpus using Gensim (Řehůřek and Sojka, 2010).
The number of topics was selected by coherence scores among 5, 10, 20 topics. We used Google pre-trained word embedding with 300 dimensions (Mikolov et al., 2013). We obtained 50-dimension code embedding by Word2vec (Mikolov et al., 2013) for the MISC codes, where each sequence of MISC were treated as a document.
We select three different approaches as our baselines with the inputs: content, context, MISC, and topic.  (Mikolov et al., 2013) in the recent past. We experiment feeding the classifier with word vectors while we keep the same parameter settings as the Perez2017 lin baseline. We deploy the strategy of concatenating word embeddings to build representations of utterances, which is denoted as "Vec-con".
• (Xiao et al., 2016) (denoted as Xiao2016): Their approach applies Bi-directional RNN to encode each utterance by both the utterance itself and its preceding one. There are two major differences between their method and ours: first, they did not consider temporality in their model, second, they did not use the previous MISC sequences as inputs. They used Gated Recurrent Unit (GRU) (Chung et al., 2014) as the RNN cell.
We use the "Co", "Ct", "MISC" to denote the utterance (content), preceding interventionist verbal behavior (context) and prior MISC annotations of utterances (MISC) respectively. And we use "All" to denote all of the inputs 3 . We use the "T" 3 The baselines did not use one or more inputs (the context to denote temporal shifts proposed in our paper. We balance training weights for the classification labels. We use metrics from scikit-learn (Buitinck et al., 2013) to evaluate the classification performance by precision, recall and weighted F1 on the intention labels. The results of our experiments are summarized in the Table 3. Findings indicate that our proposed approach leads to a small performance boost after using the topic embeddings. Thus, our simand MISC) in the original publications. We used different combinations for fair comparison. ple feature augmentation approach has the potential to make classifiers more robust. In addition, the contextual information ("Ct") is quite useful to identify the patients' current intentions, and the sequential information through time stages has strong indications of human intentions.

Significance Analysis
We conducted significance analysis to compare Xiao2016 and our proposed method. Because Xiao2016 only used content and context inputs, in this analysis, we train our method with the same inputs (Co+Ct). We followed the method of bootstrap samples (Berg-Kirkpatrick et al., 2012) to create 50 pairs of training and test datasets with replacement, where we keep the sizes the same in the Table 2. We keep the same experimental steps and use the parameters that achieved the best performances in the Table 3 to train the models.
To compare the two approaches, we conduct a paired t-test comparing the achieved F1 scores of both models. We used a two-tail test instead of one tail test used in the paper due to its increased rigor and lack of prior assumptions (Dror et al., 2018). The test reveals a significant result with t(85) = 3.084 and p = 0.00275. The result shows that we can reject the null hypothesis that our proposed method is not better than Xiao2016.

Conclusion
In this paper, we focus on the temporal characteristics of the MI corpus and propose a simple method that models the temporal factor within a single MI session. We jointly learn the utterance representation via time stage and intention predictions and the proposed model improves the performance of the classification task. The identified intent of clients could help therapists adjust their treatment strategy. In future work, we will investigate other external sources of knowledge, such as acoustic cues and videos to further improve the performance of the model.

Acknowledgements
We thank the anonymous reviewers for their constructive comments. Partial work done while the first author was an summer intern at ICT, USC. The idea of modeling temporal factor was inspired by the paper (Huang and Paul, 2018), co-authored with Michael J. Paul. This work was supported by National Institute on Alcohol Abuse and Alcoholism grants R01 AA015518 and R01 AA017427 to B. Borsari, and R01 AA012518 to K. Carey. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute on Alcohol Abuse and Alcoholism or the National Institutes of Health, or the Department of Veterans Affairs or the United States Government. The authors would like to thank the students and therapists who allowed their audiotapes to be utilized for this study.