Adversarial Training for Multi-task and Multi-lingual Joint Modeling of Utterance Intent Classification

This paper proposes an adversarial training method for the multi-task and multi-lingual joint modeling needed for utterance intent classification. In joint modeling, common knowledge can be efficiently utilized among multiple tasks or multiple languages. This is achieved by introducing both language-specific networks shared among different tasks and task-specific networks shared among different languages. However, the shared networks are often specialized in majority tasks or languages, so performance degradation must be expected for some minor data sets. In order to improve the invariance of shared networks, the proposed method introduces both language-specific task adversarial networks and task-specific language adversarial networks; both are leveraged for purging the task or language dependencies of the shared networks. The effectiveness of the adversarial training proposal is demonstrated using Japanese and English data sets for three different utterance intent classification tasks.


Introduction
In natural language processing fields, full neural network based methods are suitable for joint modeling as they can simultaneously utilize multiple task data sets or multiple language data sets to improve the performance achieved for individual tasks or languages (Collobert and Weston, 2008). It is known that joint modeling can address the data scarcity problem.
Key natural language processing technologies for spoken dialogue systems include utterance intent classification, which is needed to detect intent labels such as dialogue act (Stolcke et al., 2000;Khanpour et al., 2016), domain (Xu and Sarikaya, 2014), and question type (Wu et al., 2005) from input utterances Stolcke, 2015a,b, 2016). One problem is that the training data are often limited or unbalanced among different tasks or different languages. Therefore, our motivation is to leverage both multi-task joint modeling and multi-lingual joint modeling to enhance utterance intent classification.
The multi-task and multi-lingual joint modeling can be composed by introducing both task-specific networks, which are shared among different languages, and language-specific networks, which are shared among different tasks (Masumura et al., 2018;Lin et al., 2018). Although joint modeling is mainly intended to improve classification performance in resource-poor tasks or languages, its classification performance is degraded in some minor data sets. This is because the languagespecific networks often depend on majority tasks, while the task-specific networks often depend on majority languages. What are needed are taskspecific networks that are invariant to languages, and language-specific networks that are invariant to tasks.
In order to explicitly improve the invariance of language and task-specific networks, this paper introduces adversarial training (Goodfellow et al., 2014). Our idea is to train language-specific networks so as to be insensitive to the target task, while training task-specific networks to be insensitive to language. To this end, we introduce multiple domain adversarial networks (Ganin et al., 2016), language-specific task adversarial networks, and task-specific language adversarial networks, into a state-of-the-art fully neural network based joint modeling; we adopt the bidirectional long short-term memory recurrent neural networks (BLSTM-RNNs) with attention mechanism Zhou et al., 2016). To the best of our knowledge, this paper is the first study to employ adversarial training for multi-input and multi-output joint modeling.
Experiments on Japanese and English data sets demonstrate the effectiveness of the adversarial training proposal. To support spoken dialogue systems, three different utterance intent classification tasks are examined: dialogue act, topic type, and question type classification.

Related Work
Joint Modeling: In natural language processing research, joint modeling is usually split into multitask joint modeling and multi-lingual joint modeling. Multi-task joint modeling has been shown to effectively improve individual tasks (Collobert and Weston, 2008;Liu et al., 2016a,b;Zhang and Weng, 2016;Liu et al., 2016c). In addition, multi-lingual joint modeling is achieved by learning common semantic representations among different languages (Guo et al., 2016;Duong et al., 2016;Zhang et al., , 2017b. In addition, a few work have examined multi-task and multilingual joint modeling (Masumura et al., 2018;Lin et al., 2018). Different from the previous work, our novelty is to introduce adversarial training for multi-task and multi-lingual joint modeling. Adversarial Training: The concept of adversarial training was first proposed by Goodfellow et al. (2014), and many studies in the machine learning field have focused on adversarial training. Adversarial training has been well utilized in text classification (Ganin et al., 2016;Chen et al., 2016;Liu et al., 2017;Miyato et al., 2017;Chen and Cardie, 2018). Most natural language processing papers adopted either the language invariant approach (Chen et al., 2016;Zhang et al., 2017a) or the task invariant approach (Ganin et al., 2016;Liu et al., 2017;Chen and Cardie, 2018). This paper aims to fully utilize both task adversarial training and language adversarial training. To this end, we simultaneously introduce language-specific task adversarial networks and task-specific language adversarial networks.

Proposed Method
This section details our adversarial training method for multi-task and multi-lingual joint modeling of utterance intent classification.
In the j-th task utterance intent classification for the i-th language input utterance, intent label l (j) ∈ {1, · · · , K (j) } is estimated from input utterance , · · · , I} and j ∈ {1, · · · , J}. Utter-ance intent classification is followed by estimation of the probabilities of each intent label given input utterance, P (l (j) |W (i) , Θ (i,j) ) where Θ (i,j) is the trainable model parameter for the combination of the i-th language and the j-th task. In multi-task and multi-lingual joint modeling, {Θ (1,1) , · · · , Θ (I,J) } are jointly trained from I language and J task data sets.

Main Joint Network
The proposed method is founded on a fully neural network that employs I language-specific networks, J task-specific networks, and J classification networks as well as Masumura et al. (2018).
The language-specific network can be shared between multiple tasks, where words in the input utterance are converted into language-specific hidden representations. Each word in the i-th language input utterance W (i) is first converted into a continuous representation. Next, each word representation is converted into a hidden representation that uses BLSTM-RNNs to take neighboring word context information into account. The t-th language-specific hidden representation for the ith language is given by: task intent labels, o (j) ∈ R K (j) , are given by: where ATTENSUM() is a weighted sum function with self-attention, SOFTMAX() is a transformational function with softmax activation, and θ (j) o is the trainable parameter for the j-th classification network. In the main joint networks of the proposal,

Adversarial Networks
The proposed method combines a languagespecific task adversarial network with a taskspecific language adversarial network. The task adversarial network is used for training the language-specific networks to be insensitive to target task labels, and the language adversarial network is used for training the task-specific networks to be insensitive to target language labels. In order to efficiently use stochastic gradient descent based training for optimizing the adversarial networks, we use gradient reversal layers, which allow the input vectors during forward propagation, and sign inversion of the gradients during back propagation, to be utilized (Ganin et al., 2016).
The i-th language-specific task adversarial network estimates task labels from the i-th languagespecific hidden representations. The predicted probabilities of task labels, x (i) ∈ R J , are given by:h where GRL() represents the gradient reversal layer, and θ (i) x is the trainable parameter. The j-th taskspecific language adversarial network estimates language labels from the j-th task-specific hidden representations. The predicted probabilities of language labels, y (j) ∈ R I , are given by: where θ y is the trainable parameter. The proposed network structure shown in Figure 1 includes both joint networks and adversarial networks for two tasks and two languages. The red components are language-specific networks, the orange components are task-specific networks, and the purple components are classification networks. In addition, green components are language-specific task adversarial networks, and blue components are task-specific language adversarial networks.

Training
Our adversarial training proposal jointly optimizes all parameters in both the main joint networks and the adversarial networks by using all training data sets {D (1,1) , · · · , D (I,J) } where D (i,j) represents the sets of the input utterances and the reference. The cross-entropy loss functions of each network are defined as: where L o , L x , and L y are the cross entropy loss terms for the classification networks, the task adversarial networks, and the language adversarial networks.ô  n,i are the reference probabilities, and o n,k , x n,j , and y n,i are the estimated probabilities of the k-th label in the j-th task classification network, the j -th task in the ith language-specific task adversarial network, and  the i -th language in the j-th task-specific language adversarial network for W n , respectively. Due to use of gradient reversal layers, individual parameters are gradually updated as follows: where α and β are hyper parameters of the parameter update, and is the learning rate. Note that adversarial training is suppressed by setting α and β to 0.0. In training, we prepared optimizers for individual data sets. The individual learning rates fall when the validation loss of the target classification network increases.

Experiments
Our experiments employed Japanese and English data sets created for three different utterance intent classification tasks. The tasks, dialogue act (DA) classification, topic type (TT) classification, and question type (QT) classification, are intended to support spoken dialogue systems. For example, the task of English DA classification is to obtain a DA label from an input utterance. We used natural language texts as the input utterances and individual label sets were unified between Japanese and English. Data sets employed in experiments were corpora that were made for constructing spoken dialogue systems (Masumura et al., 2018). Each of the data sets were divided into training (Train),  validation (Valid), and test (Test) sets. Table 1 shows the number of utterances in individual data sets where #labels represents the number of labels. Table 2 shows English utterances and label examples for individual tasks.

Setups
We examined single-task and mono-lingual modeling, multi-task joint modeling, multi-lingual join modeling, and multi-task and multi-lingual joint modeling with or without adversarial training. We unified network configurations as follows. Word representation size was set to 128, BLSTM-RNN unit size was set to 400, and sentence representation was set to 400. Dropout was used for EMBED() and BLSTM(), and the dropout rate was set to 0.5. Words that appeared only once in the training data sets were treated as unknown words. We used mini-batch stochastic gradient descent, in which initial learning rate was set to 0.1. We optimized hyper-parameters of adversarial training (α and β) for the validation sets by varying them from 0.001 to 1.0. Other hyper parameters were also optimized for the validation sets. Table 3 shows the results in terms of utterance classification accuracy. For each setup, we constructed five models by varying the initial parameters and evaluated the average accuracy. Line (1) shows baseline results: single-task and monolingual modeling. Lines (2) and (3)   with only performing multi-task joint modeling, and lines (4) and (5) show results with only performing multi-lingual joint modeling. Note that lines (3) and (5) show the results achieved with adversarial training. Line (6) shows multi-task and multi-lingual joint modeling results: adversarial training was suppressed by setting both α and β to 0.0. Lines (7)- (9) shows the results achieved with adversarial training. Note that setting with bold values achieved the highest performance in our evaluation.

Results
First, in lines (2) and (4), the classification performance deteriorated in some cases, while performance improvements were achieved in other cases. On the other hand, in lines (3) and (5), classification performance in each data sets was improved by introducing adversarial training. This indicates that adversarial training was effective in improving the performance of joint modeling.
Next, line (6) shows that, relative to line 1, multi-task and multi-lingual joint modeling can improve the classification performance for Japanese TT, Japanese QT, and English TT, but classification performance was degraded for English DA and English QT. This indicates that it is difficult to simultaneously improve the classification performance for all data sets because joint modeling often depends on majority tasks or majority languages. In addition, lines (7) and (8) show the introduction of either task adversarial networks or language adversarial networks yielded better performance than line (6) for all data sets. This indicates that adversarial training was effective in improving the performance of multi-task and multi-lingual joint modeling. The best results were achieved by using both language-specific task adversarial networks and task-specific language adversarial networks, line (9). These results confirm that task adversarial networks and language adversarial networks well complement each other. Of particular benefit, the proposed method demonstrated greater classification performance improvements when the number of training utterances per label was small.

Conclusions
We have proposed an adversarial training method for the multi-task and multi-lingual joint modeling needed to enhance utterance intent classification. Our adversarial training proposal utilizes both task adversarial networks and language adversarial networks for improving task-invariance in languagespecific networks and language-invariance in taskspecific networks. Experiments showed that the adversarial training proposal could well realize the benefits of joint modeling in all data sets.