Dialogue Act Classiﬁcation in Team Communication for Robot Assisted Disaster Response

We present the results we obtained on the classiﬁcation of dialogue acts in a corpus of human-human team communication in the domain of robot-assisted disaster response. We annotated dialogue acts according to the ISO 24617-2 standard scheme and carried out experiments using the FastText linear classiﬁer as well as several neural architectures, including feed-forward, recurrent and convolutional neural models with different types of embed-dings, context and attention mechanism. The best performance was achieved with a ”Divide & Merge” architecture presented in the paper, using trainable GloVe embeddings and a structured dialogue history. This model learns from the current utterance and the preceding context separately and then combines the two generated representations. Average accuracy of 10-fold cross-validation is 79.8%, F-score 71.8%.


Introduction
Disaster response teams operate in high risk situations and face critical decisions despite partial and uncertain information.First responders increasingly deploy mobile robotic systems to mitigate risk and increase operational capability.In order for robotic systems to provide optimal support for mission execution, they need mission knowledge, i.e., run-time awareness and understanding of the mission goals, team composition, the tasks of the team(s), how and by whom they are being carried out, the state of their execution, etc.Since first responders typically operate under high cognitive load and time pressure, it is paramount to keep the burden of entering mission knowledge into the system at a minimum.The goal of our research thus is to develop methods for extracting run-time mission knowledge from the verbal communication in the response team.The acquired mission knowledge can also be used to assist the first responders during or after the mission, for example, by supporting the real-time coordination of human and robot actions or by mission documentation generation (Willms et al., 2019).
In this paper we address one particular subproblem: dialogue act (DA) recognition.DAs are needed for a better understanding of the team communication and how the mission tasks are being executed.For example, Requests communicate task assignments and thus allow us to distinguish task assignments from other task-relevant information exchange; Informs often report task progress; and Questions indicate what was unclear and required additional explanations.These distinctions are also useful for providing assistance, including compiling mission documentation.
We use the corpus of human-human team communication in robot-assisted disaster-response collected in the TRADR project (Kruijff-Korbayová et al., 2015).The TRADR team communication is task-oriented, focused on collaborative execution of a mission by a structured team using mobile robots to remotely gather situation awareness in a complex, dynamic, unknown physical environment.In this the communication differs from that in well-known existing corpora annotated with DAs.
We annotated our corpus with DAs following the ISO 24617-2 scheme (Bunt et al., 2012(Bunt et al., , 2017) ) and experimented with several machine learning approaches to DA classification.We explored various models, including different ways of taking dialogue context into account.
We overview previous work on DA classification and existing corpora with DA annotations in §2.We present our corpus in §3 and provide statistics for DA and speaker role distribution.In §4 we describe the classification models tested in our experiments and report the evaluation results.We conclude with a discussion and future plans in §5.
There has been very little work on dialogue processing in this domain so far.In the pioneering project TRIPS a decision-support dialogue system was developed for the planning of an island evacuation in the event of a natural disaster.Focus was on semantic parsing and task-specific interpretation.This approach was further developed to handle various more complex emergency tasks covered in the Monroe corpus (Stent, 2000).This work focused on mission planning (not execution), data was collected in lab (not real disaster environment) and the participants were students (not real first responders).DAs were annotated using the DAMSL scheme (Core and Allen, 1997).
Some works on human-robot collaboration for disaster response address the interpretation of verbal commands to robots (Kruijff et al., 2014;Yazdani et al., 2018), but not the overall team communication.
In (Martin and Foltz, 2004) automatic analysis of the semantic content of team communication and automatic verbal behavior labeling was used to assess team performance in a command and control task with an unmanned aerial vehicle in a simulated environment.A corresponding synthetic team-member agent is described in (Cooke et al., 2016).Since the corpus is not available and the publications do not provide details on the task and communication complexity, a closer comparison to our work is not possible.Communication analysis was used also in (Burke et al.).They designed and manually applied a team communication coding scheme, in order to examine robot operator situation awareness and technical search team interaction during a high-fidelity disaster response drill with teleoperated robots.DAs are reflected in their annotation of the forms and functions of communication contributions.
Corpora with DA annotations include also wellknown human-human dialogue corpora, such as MapTask (Anderson et al., 1991;Carletta et al., 1997); TRAINS (Allen, 1991); Switchboard (Godfrey et al., 1992); Meeting Recorder Dialogue Act (Shriberg et al., 2004) and the AMI Meeting Corpus (Carletta et al., 2005), and re-cent large corpora, e.g., Maluuba Frames (Schulz et al., 2017) and MultiWOZ (Budzianowski et al., 2018)).These corpora cover different domains and the goals the participants follow in their interaction are quite different from what is going on in the team communication in our domain.
Despite the differences it would be interesting to see how DA classification models developed on other exiting corpora perform on our corpus.The challenge of such endeavor is, however, that different and sometimes very task-specific schemes have been applied to annotate DAs.For instance, some of the DAs in the Maluuba Frames corpus include domain-specific labels such as Canthelp and No result as well as Thankyou and Moreinfo.
The ISO 24617-2 standard for DA annotations introduced in (Bunt et al., 2012) and further defined in (Bunt et al., 2017) was proposed to overcome this.To date several corpora have been annotated accordingly and made available through the DialogBank (Bunt et al., 2016).Although the mapping of DA labels from other annotations to the ISO standard is quite straightforward in some cases (e.g., for Inform or Request), in other cases the specificity of the domain prevents from further generalizations, as discussed in (Chowdhury et al., 2016).These issues lead us to postpone transfer learning for future work and start traditionally by experiments on our own corpus.
Only few works to date systematically tested different kinds of context for DA classification.Several experiments on the Switchboard corpus are described in (Ribeiro et al., 2015), which tested untagged and index-tagged n-grams as well as context presented in the form of dialog act annotations for the previous segments.Index-tagged n-grams (n-grams tagged with the distance to the current segment) improved accuracy significantly, from 70.6% to 75.1%, and the DA annotations for the preceding segments even to 76.4%.(Liu et al., 2017) tested different kinds of context for DA classification using deep neural mod-els.They present hierarchical models based on convolutional neural networks (CNN) for sentence representations which they combine with dialogue history.They encode context as previous DA labels and as probabilities for system predictions, and experiment with dialogue history of varied length.Including context information in their models evaluated on the Switchboard corpus resulted in significant increase of accuracy from 77% to almost 80%.These results indicate that context should be taken into account when processing structured conversations.

The Corpus
We use the corpus of robot-assisted disasterresponse team communication collected during joint exercises with first responders in the TRADR project (Kruijff-Korbayová et al., 2015). 1 The TRADR corpus contains audio recordings and transcriptions of the speech communication in a team of firefighters using robots in the aftermath of an incident, e.g., an explosion, at an industrial site.The team members have various roles: mission commander (MC), team leader (TL), operators (OP) of multiple ground (UGV) and aerial (UAV) robots.They explore the site, searching for persons, hazard sources, fires and other relevant points of interest.The MC and the TL lead the mission.They request situation information from the OPs, who report back with updates and can also share photos taken by the robot camera (see the example in Appendix A).
The recordings were collected during several field tests in 2015, 2016 and 2017.They amount to approximately 10 hours and contain almost 3k speech turns (see Table 1 for details).The 2015 and 2016 recordings are in German, the 2017 ones in English.For the experiments presented in this paper we used the original English data as well as English translations from German.We started on English because of available resources.
Before annotating DAs following the ISO 24617-2 scheme (Bunt et al., 2012(Bunt et al., , 2017)).we segmented the data into utterances; we split and merged some turns to obtain appropriate spans for assigning DAs.This resulted in 2469 utterances.
The ISO scheme defines several dimensions and for each of them a hierarchy of commu- nicative functions (a.k.a.DAs).The first author and another annotator independently annotated each utterance with one of the dimensions and a corresponding DA.Inter-annotator agreement was =.77 for dimension assignment and =.55 (weighted =.66) for the generic communicative functions in the Task dimension.For the experiments in this paper we used the first author's annotations as a golden reference.We focused on the classification of DAs from the dimensions Task, Feedback and Turn Management (see Table 2 for the used labels).
We annotated the corpus in full compliance with the ISO scheme.Since some DAs had too few occurrences in the corpus we used a simplified set of DA labels in the experiments (see §4.1).The simplified labels are a result of a direct mapping from the ISO scheme labels (see Table 2), making it easy to compare DA classification results.In most cases the simplified labels can be seen as generalized ISO DAs which were selected based on their utility for the disaster response domain.
The mission interactions consist of threads, which are dialogue sequences where two (occasionally multiple) team members talk about a task or situation update, e.g., the TL talks to an OP as illustrated in the example in Appendix A. A new thread is initiated by establishing contact following the standard radio communication protocol.The threads are a good candidate for dialogue context and we used thread history in some experiments as we will describe in the next sections.

Pre-processing
Before running the experiments we pre-processed the data as follows.
First, we collapsed DA labels which had very low frequency in the corpus with more frequent ones.For instance, there were only 2 cases of Ad-dressSuggestion and 9 cases of AcceptOffer in total.Low frequency labels would introduce noise and prevent the classifier from learning reliable patterns.Moreover, there were some ambiguous cases with several possible annotations (e.g.Inform and Promise for "I'll send it over to you") and we decided to retain the most frequent label to reduce the perplexity.Table 2 shows the mapping of the manually annotated ISO scheme labels to the DA labels used for the automatic classification.The resulting distribution of DA labels is shown in Table 3.
Second, we removed all punctuation.Although punctuation can be a good clue for some DAs (e.g., "?" usually indicates Question) we removed it, because the ASR software often does not provide punctuation reliably.We also transformed all texts to lower case and padded sequences when using neural networks.For 10-fold cross-validation we split the 2469 utterances into 2222 for training and 247 for testing in each fold partition.

Baselines
We implemented three baselines.The majority baseline assigned each utterance the most frequent label for the given role, i.e., all MC/TL utterances were annotated as Contact and all OP utterances as Inform.This resulted in accuracy 34.8% The fact that all TL utterances were classified as Contact was an obvious drawback.We therefore tried a relative-frequency baseline as an alternative, using the relative frequencies for each DA on the complete corpus (cf.Table 3).Each utterance was assigned a random class based on the relative frequencies.This baseline had accuracy 24.7%2 .The majority baseline which used solely the role was substantially better compared to the frequency-based random baseline.
The third mixed baseline was based on the assumption that all instances of Contact are identified correctly and for all other utterances we used the majority baseline.Therefore, the third baseline assigned Request to all MC/TL utterances and Inform to all OP utterances which were not labeled as Contact.This baseline had accuracy 47.2%.Since these three baselines had such a low performance we considered the results of the FastText classifier as a baseline for evaluating the performance of the neural models.

FastText
As the first model for DA classification we tested FastText 3 , an open-source library for text classification and representation using supervised learning with multinomial logistic regression.Although it can represent input text in the form of embeddings it belongs to the family of linear classifiers.We ran FastText using the parameters recommended for a small training set (10 dimensions, 0.5 learning rate, 20 epochs).The average accuracy over a 10-fold cross-validation was 74.0%.It was consistent across the folds (see Table 4).Because of the strong correlation between the speaker role and the DA distribution, as shown in Table 3, we also experimented with including the role as a special token at the beginning of each utterance.This additional information improved the average accuracy to 75.6% and also the accuracy in most folds (see Table 4).Finally, we tested the effect of adding the dialogue thread con- text: we appended the corresponding thread history to each utterance and trained FastText on this extended input.Accuracy dropped for all folds, to 64.0% on average as shown in Table 4.

Neural Networks
Neural networks have already shown great potential in tackling various NLP tasks, including DA classification (Chen et al., 2018;Liu et al., 2017).We therefore also tested various neural architectures to classify DAs in our corpus: Feed-Forward Neural Networks (FFNN); Recurrent Neural Networks (RNN), in particular Long-Short Term Memory (LSTM) and bidirectional LSTM models; Convolutional Neural Networks (CNN).We experimented with attention and different kinds of embeddings (including Word2Vec, GloVe and FastText).We also tested the effect of the dialogue context in the form of the preceding thread history concatenated with the current utterance.We present the models and the DA classification results in the next sections.

Feed-Forward Neural Networks
We implemented a simple FFNN using the Keras 4 library with one Embedding layer (we experimented with 100, 200 and 300 dimensions) and applied global average pooling to average the embeddings of all words in the utterance before sending them through the Dense layer.The architecture is shown in Figure 1.
We set the minibatch size to 8, trained the network for 5 epochs and used Adam as an optimizer.We trained several models using the Embedding layer provided by Keras as well as pre-trained GloVe embeddings obtained from the Stanford NLP group website, 5 which were learnt on the data from Wikipedia 2014 and Gigaword 5 (6B tokens, 400K vocabulary).We also experimented with both frozen and trainable embeddings.The results were consistently better with trainable embeddings compared to the frozen version.Table 5 shows the evaluation results with accuracy scores averaged across 10 folds.

Convolutional Neural Networks
Inspired by the results on DA classification with CNNs in (Liu et al., 2017) we also tested CNNs with varying number of convolutional layers and filter sizes on our data.Figure 2 shows a sample architecture with two convolutions and 128 filters of size 5.We also tested CNNs with different embeddings.The best performance (average accuracy 72.1%) was achieved by the model with one convolutional layer, filter size 10 and embeddings

Recurrent Neural Networks
We tested RNNs with Long Short Term Memory (LSTM) cells, both LSTMs and bidirectional LSTMs.We also applied an attention mechanism and experimented with various embeddings and regularization parameters.In some experiments we concatenated all previous utterances from the same thread with the current utterance in order to give more context to the classifier.We inserted a #START# symbol between the current utterance and the thread text as a separator.
Figure 3 shows the RNN architecture with bidirectional LSTM and attention mechanism.The attention layer follows the idea proposed in (Raffel and Ellis, 2015).We passed the generated word vectors through bidirectional LSTM and multiplied the input with the attention vector at each time step.The result was passed through the Dense layer with ReLU as an activation function.Dropout 0.25 was applied to the function output before it went through the final Dense layer.We tested this model with single utterances as well as with utterances concatenated with their corresponding thread history, with and without attention.The results of different RNN architectures are in Table 7.The best accuracy of 78.0% was achieved by the model which used the thread history and pre-trained GloVe embeddings with trainable weights, no attention.
In the experiments described above we noticed  8 and 9.
Table 8 shows the results for various experimental settings.First, we report the accuracy scores obtained by the D&M model without LSTM, D&M which uses LSTM for encoding both turn and thread utterances and D&M which uses LSTM only for turns while the thread information is encoded using one Embedding layer and global average pooling as shown in Figure 4.The model with turn-only LSTM achieved the best accuracy 79.8%.Second, we also compared different word embeddings (GloVe, Word2Vec and FastText) and found that pre-trained GloVe embeddings with 200 dimensions work best on our data.

Discussion
To compare the performance of the D&M model (accuracy 79.8%) against that of the FastText classifier (accuracy 75.6%) we applied a randomized test with 10,000 trials.The resulting p-value of 0.0001 indicates a significant difference.The accuracy of both FastText and D&M is also significantly better than that of the baselines (24.7% for the relative-frequency baseline, 34.8% for the majority baseline and 47.2% for the mixed baseline).We also compared the performance of the D&M model with threads to the same model without thread information.The results are in Table 11.Note that Tables 10 and 11 show average precision, recall and F1 score for two cases: with and without the category Negative.Negative turned out to be very difficult to classify because of the Table 11 shows that F1 score increases when the thread information is provided as an additional input to the model.For all DAs except for Disconfirm and Negative we observe an improvement in terms of precision, recall and F1 score.The poor performance of D&M model on categories Negative and Disconfirm could be due to the fact that some threads are interconnected and Negative is often a response to the previous thread.For instance, in one thread the OP says "I will put snapshots in ..." And in the next thread the TL says "I don't have snapshots" which should be interpreted as Negative with respect to the previous statement.However, D&M classifies the utterance as Inform because it does not see the connection between two different threads.I am driving closer now".Although there were some instances of Contact in the training corpus starting with "yeah", Contact is not a good candi-date in this case given that the previous utterance was labeled as Request.This shows that thread history has an impact on the output of the D&M model.The D&M model makes better use of the thread history than FastText and seems to offer a better model for structured conversations.
In general, the independence assumption made by FastText impairs the classification performance.However, adding thread history resulted in an accuracy drop from 75.6% to 64.0% (cf.§4.3).This means that it is not only thread information that is important for correct classification but also the way this information is encoded and processed by the classifier.Whereas FastText treats the current utterance and the thread history in a bag-ofwords fashion, the D&M model treats them as two independent inputs which are being processed by two different parts of the network and their representations are concatenated only at the final stage.
We also tested several models on the part of the Switchboard Corpus available in DialogBank (Bunt et al., 2016).After pre-processing similar to what we did for our corpus we had 443 utterances.We split them into 333 (75%) for training and 110 (25%) for testing.FastText achieved accuracy 60%.Among the neural models a simple FFNN using the Embedding layer initialized with pre-trained GloVe embeddings with 100 dimensions achieved best accuracy 73.6%.The D&M model could not be applied to the DialogBank-Switchboard data because there are no clearly delimited threads.It would be interesting to test the D&M approach on other corpora with dialogues structured into threads similarly to our corpus.

Conclusions
We presented the results of dialogue act classification in robot-assisted disaster response team communication.We experimented with a FastText classifier and various neural models using FFNNs, RNNs and CNNs with different types of embeddings and context information, with and without attention.We found that including the speaker role is beneficial whereas adding the previous sentence as dialogue context leads to lower accuracy.This might be due to the fact that dialogues in our corpus consist of threads and concatenating an utterance with a preceding one from a different thread causes erroneous predictions.We then designed the Divide&Merge model, where we added thread history in a separate layer and concatenated not texts but their vector representations.This resulted in a significant improvement with average accuracy 79.8%.Using LSTM cells was beneficial for utterance encodings but the thread history was better encoded using the Embedding layer and global average pooling.Pre-trained GloVe embeddings with dimensionality 200 performed best on our data and the results were slightly better with trainable embeddings.This could be due to the fact that in our corpus some words have non-standard interpretations based on the communication protocol (e.g., "roger that"), which are learned from the corpus when we use trainable embeddings.
Incorporating thread information significantly improved DA classification.In the future we wish to investigate more the nature and importance of threads in team communication, e.g., whether to model threads implicitly (as we did) or explicitly; how to best segment them; how important is it to represent intertwined threads; is information throughout a thread used for interpretation or is the influence more local at the thread boundary.
In future work we will also apply the models presented here on the German data in the TRADR corpus; test their performance on the outputs of ASR without any editing by human annotators; look for ways to further improve performance, e.g., by enlarging the corpus by adding relevant dialogues from other corpora.We will develop models for the recognition of mission tasks and distinguishing task requests and commitments by the team members from other task mentions.We will then combine dialogue act and task recognition in a single model.We will release the corpus with the ISO dialogue act annotations later this year.
The models we develop are being integrated as part of the speech processing pipeline in a mission-support system that provides process assistance and facilitates the creation of mission documentation (Willms et al., 2019).It will be evaluated in practice with and by first responders.

Further
manual checking of the classification results confirmed that the D&M model could handle DAs which depend on the context better.Table 12 illustrates this: In Thread 1 FastText almost always picked Inform as the most likely label, whereas D&M assigned more DAs correctly.In Thread 2 FastText assigned Contact for "Yeah, o Neg.): 0.69 0.69 0.69 0.72 0.71 0.72 Average (with Neg.): 0.64 0.61 0.62 0.63 0.62 0.63

Table 2 :
Mapping of ISO annotation labels to labels for automatic classification

Table 3 :
Dialogue act distribution

Table 6 :
DA classification results for CNN models trained on our data with dimensionality 100.An overview of the results obtained with various CNN architectures is in Table6.Interestingly, more complex models resulted in worse scores.Convolutions appear not very useful for the relatively short texts of dialogue utterances.

Table 9 :
Divide&Merge 10-fold cross-validation Table 10 contains the results for precision, recall and F-score per DA.

Table 10 :
FastText and D&M results per DA

Table 11 :
D&M results with and without threads

Table 12 :
Sample DA classification results by FastText and D&M.Correctly assigned DAs are typeset in bold.