Gated Multi-Task Network for Text Classification

Multi-task learning with Convolutional Neural Network (CNN) has shown great success in many Natural Language Processing (NLP) tasks. This success can be largely attributed to the feature sharing by fusing some layers among tasks. However, most existing approaches just fully or proportionally share the features without distinguishing the helpfulness of them. By that the network would be confused by the helpless even harmful features, generating undesired interference between tasks. In this paper, we introduce gate mechanism into multi-task CNN and propose a new Gated Sharing Unit, which can filter the feature flows between tasks and greatly reduce the interference. Experiments on 9 text classification datasets shows that our approach can learn selection rules automatically and gain a great improvement over strong baselines.


Introduction
The combination of multi-task learning and neural networks has shown its advantages in many tasks, ranging from computer vision (Misra et al., 2016;Ruder et al., 2017) to natural language processing (Collobert and Weston, 2008). Multi-task learning (MTL) has the ability to share the knowledge among the joint tasks, which implicitly increases the training materials (Caruana, 1997). The shared knowledge help the network learn a more universal representation for the inputs. Inspired by this, more DNN-based approaches (Liu et al., 2015; utilize multi-task learning to improve their performance.
The scheme for information sharing is the linchpin for designing an elaborate multi-task network. Most existing work attempts to find a appropriate proportion to sharing the layers between tasks, despite they entirely reuse the shallow layers (Liu et al., 2015;Caruana, 1993) or add the layers up at a ratio (Fang et al., 2017). And recently, the latter one shows its advantages for controlling relational intensity among tasks and become prevailing. More models adopt this thought to enhance the performance (Liu et al., 2015(Liu et al., , 2016. However, under the scheme of proportional addition (Ruder et al., 2017;Misra et al., 2016), all the features are shared with the same weight between every pair of tasks. Helpless or harmful features may be transported between tasks with the same importance as helpful ones, namely, the interference is generated. This would burden the network for distinguishing the helpful features and even mislead the predictions.
To solve above problem, we propose a new CNN-based architecture for multi-task learning, which can share features in a selective way. Our model allocates a private subnet to each task and transport the features between the subnets with a well-designed module-Gated Sharing Unit. It has the ability to filter features with gate mechanism (Chung et al., 2014;Srivastava et al., 2015) and select the helpful ones to benefit the tasks in hand, which expands the feature spaces and provides more evidence for right predictions. Our model is an end-to-end method and the proposed Gated Sharing Unit is easy to train.
We conduct extensive experiments on 9 benchmark datasets for text classification. The results show that our model greatly improves the performance and surpasses the single-task models and other competitors.

Gated Multi-Task Network
To make full use of multiple datasets and, meanwhile, avoid the interference, we introduce a new structure for multi-task learning in this section. The new structure is designed in a separative way-every task owns a private subnet. To share features across the subnets, gate mechanism is designed to selectively allow the features been exchanged. Our new model can be trained end-toend, needing no extra supervision or handcraft hyperparameters. And it can be easily transferred to other networks such as DNN, RNN, LSTM, etc. Figure 1 illustrates the design of model structure and other details.

Model Architecture
Multi-task model with deeper layers shared can augment deeper knowledge and greatly increase the feature space . But undesirable interference inevitably and simultaneously comes with the benefits, especially between lessrelated tasks. This would burden the models with the overhead on distinguishing helpful features.
To overcome this problem, we assign each task a private subnet as illustrated in Figure 1. Tasks are relatively separated and can borrow the useful information from others through a bridge, Gated Sharing Unit (GSU). The weight of each feature in this unit is automatically learned from previous layers, needing no extra supervision, so there is more selectivity across the tasks. By filtering out useless features, tasks receive less interference Figure 2: Illustration of Gated Sharing Unit from each other.

Gated Sharing Unit
For reducing interference, it important to filter the information flows among the tasks. Hence, in this section, we introduce the mechanism of gate, which originates from the cells of recurrent neural networks like Long-Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997), Gated Recurrent Unit (GRU) (Chung et al., 2014). Gated mechanism in existing studies not only shows its convenience for training (Srivastava et al., 2015), but also behaves as tool to route the information (He et al., 2016). Inspirited by gate mechanism, we propose a new module GSU to control the information flows and selectively share the features among tasks. The details of this module is shown in Figure 2. For notation, we refer to C as the collection of N tasks and C = {1, 2, · · · , N }. For a sample from arbitrary task j, a series of feature maps are generated in subnets. When task j borrows the features from task k, a gate g is inserted to select the helpful ones, which is calculated from the prior layer by where l means the level of the layers and σ denotes the nonlinear activation of sigmoid, which guarantees the values of g in the [0, 1]. Note that the gate g l jk is vector. Each component in it controls the pass of a corresponding feature. Their states move between pass and interception, or choose a middle ground if needed.
For task j, the output F l+1 j of gates is calculated by fusing the lower layers F l from all the tasks by where ⊙ denotes element-wise multiplication. To represent the output for all the tasks C, we can stack Eq.
(2) in matrix form (3) From Eq. (2) and (3), we know that, in the GSU, the feature map for current task directly passes into the next layer. But the features from other tasks are merged into current task after the selection by gates. In this way, the shared features tend to be pure and helpful for current task, which avoids the harmful interference existing in conventional models.
For comparison, here we briefly introduce the methods that share the features by proportional addition (Misra et al., 2016;Ruder et al., 2017;Fang et al., 2017). They can be constructed by inserting a scaler weight α l jk between every two tasks i, j. α l jk is updated by back-propagation and reflects the degree of association between tasks, but do not select the features. In this paper, this kind of models is alluded to as PA-CNN.

Output Layer and Loss
In the last layer of task j, vector representationŝ F j of input sequences are ultimately fed into corresponding softmax layers to fit the number of classes, which emits the prediction of probability distribution for the task ĵ whereŷ j is predictive result; W j is the weight of the full-connected layer; and b j is the bias term. Given the prediction of all tasks, a global loss function forces the model to minimize the crossentropy of prediction and true distribution for all the tasks: where λ j is the weight for the task j. In this paper, we set λ j to 1/N for all N tasks to make a balance.

Experiments
In this section, we demonstrate the empirical performance of our model on 9 related benchmark tasks for text classification. And the results are compared with the state-of-the-art models.

Datasets
As Table 1 shows, we select 9 related benchmark datasets for text classification. The first 6 datasets are all about product reviews, which are comprised of Amazon product reviews in 6 domains, including books, DVDs, cameras, etc. These corpora are classified according to the sentiment of positiveness or negativeness. They are collected from the raw data published by (Blitzer et al., 2007).
The rest 3 datasets are RN, SUBJ and TREC. RN is a dataset about news topic classification, which is collected from Reuters Newswire and published by (Velasco et al., 1994); SUBJ is a subjectivity dataset, whose task is to classify a sentence level text as being subjective or objective (Pang et al., 2004); TREC dataset has the task of classifying a question into 6 types (the questions are about location, person, numeric information, etc.) (Li and Roth, 2002).

Hyperparameters and Training
For all the experiments, we employ Word2Vec (Mikolov et al., 2013) to initialized the word vectors, which is trained on Google News with 100 billion words. The vectors have dimensionality of 300 and are trained by continuous bag-of-words architecture. All the other parameters are initialized with random values from uniform distribution in [-0.1, 0.1]. For every subnet we use: rectified linear units, filter windows of 3,4,5 with 100 feature maps each, mini-batch size of 50, dropout rate of 0.5, l 2 constrain of 3, learning rate of 10 −3 . All the hyper-parameters are chosen via a small grid search on dev set. For the dataset without a stan-  Table 2: Accuracies of our model against other state-of-the-art methods. Single-Task column shows the results of plain DCNN (Kalchbrenner et al., 2014), LSTM (Jozefowicz et al., 2015) and BiLSTM. First 3 models in the Multi-Task column shows the results of multi-task models: MT-DCNN (Liu et al., 2015), MT-RNN , MT-CNN (Collobert and Weston, 2008). The remaining columns of PA-CNN and GMT-CNN shows the performance of proportional addition or gate mechanism. Number in round bracket denotes the average improvement over BiLSTM.
dard dev set we randomly select 10% as dev set.
The whole network is trained through stochastic gradient decent using Adadelta update rule (Zeiler, 2012). Table 2 shows the comparison of the accuracies. All the results for multi-task learning models are achieved by training simultaneously on 9 datasets. From the table, we can see that the models employing multi-task learning improve the performance on most tasks beyond the single-task models, in which our model achieves the highest accuracies. Specifically, our model boosts the performance by 3.3% over the best single-task model BiLSTM, outstripping other multi-task models by at least 0.9%. Additionally, we also compare our model with the PA-CNN, a variant keeps the structure of GMT-CNN but shares the features by proportional additions. For PA-CNN, performance on several datasets is decreased than single-task due to the interference. In contract, our model shows steady improvement in all the datasets and surpasses PA-CNN by 1.3%, which indicates the effectiveness of gate mechanism.

Visualization
To intuitively show the selection process, we design an experiment to show the values of gates and how they block the useless features. For the first convolutional layer and GSU, we visualize the activations F 1 j of the filters with normalized values and show their corresponding weights g 1 jk in the gate units. By that we can easily find what kind of features are discarded as interference. Figure 3 illustrates the behavior of GSU on a random selected sentence from Baby task. We visualize the results of the first feature map for DVDs subnet and the gate unit that filters the features from DVDs to Baby task. For the positive sentence "Five stars, my baby can fall asleep soon in the stroller", we can see that subnet for DVDs task focuses on two critical positions "Five stars" and "asleep". The word "asleep" is negative for DVDs task, but actually neutral for Baby task. Successfully, our gated unit lowers the intensity of the interference "asleep", making a correct prediction. However, PA-CNN wrongly makes a negative prediction for lacking resistance to interference. This indicates the effectiveness of our gate mechanism for the feature selection in MTL.

729
In this paper, we introduce gate mechanism in multi-task CNN to reduce the interference. The proposed model has an ability to select the potentially useful features, which can reduce the interference among tests. The effectiveness of our method is fully validated on 9 datasets for text classification and further illustrated by visualization experiment.
In future work, we would like to investigate the effect of memory mechanism for multi-task learning, which is similar to gate mechanism but more complex. It originates from recurrent neural network and have been proven effective for feature selection.