Bag-of-Words Transfer: Non-Contextual Techniques for Multi-Task Learning

Many architectures for multi-task learning (MTL) have been proposed to take advantage of transfer among tasks, often involving complex models and training procedures. In this paper, we ask if the sentence-level representations learned in previous approaches provide significant benefit beyond that provided by simply improving word-based representations. To investigate this question, we consider three techniques that ignore sequence information: a syntactically-oblivious pooling encoder, pre-trained non-contextual word embeddings, and unigram generative regularization. Compared to a state-of-the-art MTL approach to textual inference, the simple techniques we use yield similar performance on a universe of task combinations while reducing training time and model size.


Introduction
Multi-task learning (MTL) is usually framed as a discriminative learning problem in which predictors are learned jointly for multiple related tasks, under the premise that jointly optimizing related tasks will yield more robust parameter estimates.
In this work, we consider a collection of twosequence classification tasks covering sentiment analysis and textual entailment. Previous work has shown that for these kinds of tasks, models incorporating only bag-of-words (BOW) features are competitive with models based on sequence encoders such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs) that build compositional sequence representations (Iyyer et al., 2015;Wieting et al., 2016;Arora et al., 2017). Arora et al. (2017) suggest that BOW models better exploit the semantics of a sequence than RNNs do. * Work done while at Johns Hopkins University. 1 Our code is available at https://github.com/ felicitywang/tfmtl. Arora et al. (2017) show that improving contextindependent word-level representations may be sufficient for good performance on particular kinds of tasks. Here we ask if those findings extend to the MTL setting, and in particular how well the BOW techniques capture transfer among tasks.
We additionally observe that the standard MTL framing does not make full use of the available labeled data, as it ignores an important type of related task: generative reconstruction of the observations ( §2.3). The MTL framework naturally accommodates reconstruction simply as additional tasks.
In this paper, we: (1) consider bag-of-words techniques including pooling encoders, pre-trained word embeddings, and unigram generative regularization, and (2) demonstrate that bag-of-words techniques are competitive with sequence-level techniques in MTL for sentiment analysis and textual inference ( §3).

Bag-of-Words Techniques
We employ three approaches that use only bagof-words representations: pooling (aggregation) encoders, pre-trained word embeddings, and unigram generative regularization. These approaches do not model sequence-level interactions. We do not use contextualized encoders such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) because they incorporate sequence-level and positional representations.

Pooling Encoders
We first consider a variant of the deep averaging network (DAN) encoder (Iyyer et al., 2015). The DAN encoder is a syntactically-oblivious encoder that consists of three steps: average (mean-pool) a sequence's non-contextual word embeddings, pass the average through feed-forward layers, and then perform linear classification on the final layer's representation. We concatenate a max-pooling operation to the mean-pooling used in the first step of the original DAN encoder 2 and use a non-linear transformation in the final layer 3 .
Pooling encoders such as DAN and PARAGRAM-PHRASE (which has no parameters) are much faster to train than LSTMs and CNNs, and have been shown to have competitive performance on textual similarity, textual entailment, and sentiment classification tasks (Iyyer et al., 2015;Wieting et al., 2016;Arora et al., 2017).

Pre-Trained Word Embeddings
A popular way to improve performance over the use of randomly initialized word embeddings is to use pre-trained word embeddings that have been learned from large corpora. The use of pre-trained embeddings is an example of transfer learning, which unlike MTL typically involves a pipeline of tasks rather than a joint training objective. Word embeddings are usually learned by fitting a language model (or other word prediction objective) on an out-of-domain text corpus (Mikolov et al., 2013;Pennington et al., 2014).
Although pre-trained word embeddings are learned in context and can thereby capture distributional syntactic information, good performance using pre-trained word embeddings would be evidence that sequence-aware models may not be necessary for MTL for the tasks we consider here.
Because we restrict our models to use only bagof-words features, we seek to avoid any syntactic or sequential information that could be derived from our inputs. Any syntactic information present in pre-trained word embeddings comes from the sequences used in pre-training, not from the data in our tasks. By using pre-trained word embeddings, we seek only to determine what benefit is provided by initializing the corresponding parameters with the pre-trained embeddings rather than with random embeddings.
Additionally, contextualized encoders would capture sequential or positional information in our data inputs, so we do not use them. By not using contextualized encoders, each word has only one embedding, which is used regardless of its context. 2 We tried combinations of mean-pooling, max-pooling, and min-pooling, and found mean-pooling + max-pooling performed the best based on held-out dev-set performance. 3 We tried ELU, ReLU, sigmoid, and tanh, and chose ReLU based on held-out dev-set performance.

Unigram Generative Regularization
We examine the incorporation of unigram generative regularization (UGR) for all tasks, in which we reconstruct the input sequence using a conditional unigram language model p θ (x | h). 4 Intuitively, generative regularization provides signal that addresses the question, "What do inputs with a particular label tend to look like?" For example, we wish to capture information about inputs that express positive sentiment separately from information about inputs that express negative sentiment. We explore multi-task UGR in this work because we found that single-task UGR can improve performance (see Table 3). Additionally, multi-task UGR uses no additional data, so we get it "for free." UGR is inherently related to a dataset t's corresponding discriminative task that learns q φt (y | x), and it can be viewed as simply another task in the set of auxiliary tasks because it is realized as an auxiliary loss term.
For arbitrary networks q φt (y | x) and p θ (x | h), our loss function, L GMTL , on a single example is: for input x (t) i and its label y i . The discriminative and reconstruction task weights are α t and β t , respectively.

Experiments
As an external baseline, we compare our approach to methods proposed by Augenstein et al. (2018), herein referred to as ARS. ARS achieve state-ofthe-art performance on topic-based sentiment analysis. We reimplement their baseline model as an additional comparison in our results (Table 3).
The main contributions of ARS are additional architectural components called the label embedding layer (LEL) and the label transfer network (LTN). In the baseline model, an example's two input sequences, x 1 and x 2 , are encoded using a two-stage bi-directional RNN and then passed into a taskspecific classification layer. In the LEL model, the task-specific classification layers are replaced by a label embedding matrix shared by all tasks. By embedding all the tasks' labels into a shared space, the LEL learns correlations among the tasks' labels.
The LTN sits on top of the LEL and induces "pseudo-labels" for main task examples based on predicted distributions over labels made by each of the auxiliary tasks. The LTN is added to the main model after a pre-training step.
We note that ARS deliberately avoid pre-trained word embeddings in order to highlight their modeling contributions. We would expect their results to improve if pre-trained embeddings were used.

Datasets
We use the same two-sequence text classification datasets covering textual entailment and sentiment analysis used by ARS 5 : MultiNLI (Williams et al., 2018), ABSA-L/ABSA-R (Pontiki et al., 2016), Target (Dong et al., 2014), Stance , Topic-2/Topic-5 (Nakov et al., 2016), and FNC-1. 6 All of the inputs have two sequences (x 1 , x 2 ), the second of which (usually a longer text, such as a Tweet or a news document) is read in the context of the first sequence (which is usually shorter, such as the topic/target/aspect of a Tweet, or a news headline). Detailed information about each dataset is shown in Table 1.
For each of our main tasks, we use the bestperforming set of auxiliary tasks found by ARS (Table 2). To maintain comparability, we follow the same steps as ARS for preprocessing the data. In particular, MultiNLI was downsampled to the same 10K training examples (2.5%) as ARS, and so we refer to it as MultiNLI 2.5% . 7

Training Procedure
In all experiments, we seek to optimize performance on the main task, rather than optimize an aggregate metric across main and auxiliary tasks. We set the discriminative task weights α t = α = 1 for all discriminative tasks, and we fix the reconstruction task weights β t = β across all reconstruction tasks for a given set of main and auxiliary tasks. We found performance improves when β α, which is consistent with the treatment of reconstruction as a regularizing task. 8 In general, α t and β t may be tuned separately for each task.
We use 100-dimensional GloVe 6B 9 word embeddings and initialize the embeddings of words that appear in the GloVe vocabulary with their pre-trained embeddings (Pennington et al., 2014). Other words' embeddings are initialized randomly. All embeddings are fine-tuned during training.
Because we want to see if good performance can be attained without sequence-level information, we reconstruct x 2 using a unigram decoder, which projects the conditioning information h into a distribution over the vocabulary.
The conditioning vector decomposes as h := [t, y , π 1 ], which consists of: (1) a one-hot encoding t of the task index t; this allows the language model to adapt to different tasks (Daumé III, 2007); (2) a task-specific projection y = L t y of the onehot label vector y, where L t ∈ R l×|Yt| are trainable task-specific parameters; this projection transforms labels from potentially disparate label spaces Y t of different sizes to the same space; and (3) the input encoding π 1 , which conveys information about x 1 , on which we condition the reading of x 2 . 10 Together, the elements of the conditioning vector h provide for controllable text generation, in which the task, label, and context x 1 together influence the distribution over words of x 2 parametrized by p θ (Hu et al., 2017). 11

Discussion
Our experimental results are presented in Table 3. For the sake of comparison, we keep with the set of auxiliary tasks used by ARS, which are listed in Table 2. Other combinations of tasks may give better performance for the techniques we examine.
Using just bag-of-words features, our best models outperform the reimplementation of ARS's baseline bi-directional RNN model in 4 of 7 cases and achieve competitive results in the other 3 cases. Our results are also competitive with ARS's bestperforming models, which may use the label embedding layer and label transfer network.
The DAN encoder in the single-task learning (STL) setting is competitive with ARS's STL results and with our STL and MTL reimplementa-  Aspect-based sentiment analysis, restaurant domain  Target  3  5,623  Target  Text  Target-dependent sentiment analysis  Stance  3  3,209  Target  Tweet  Stance detection  Topic-2  2  5,177  Topic  Tweet  Topic-based sentiment analysis, binary  Topic-5  5  7,236  Topic  Tweet Topic-based sentiment analysis, fine-grained FNC-1 4 39,741 Headline Document Fake News Detection  tions, confirming the findings of previous work discussed in §2.1.
The inclusion of unigram generative regularization (UGR) improves STL DAN performance in 5 of 7 cases (GSTL), motivating its use in the MTL setting. If GSTL performance achieves desired performance, then one saves a search over auxiliary tasks, such as those in (Liu et al., 2016;Augenstein et al., 2018). However, UGR hurts MTL performance in 6 of 7 cases (GMTL). Furthermore, GMTL performance is worse than GSTL performance in all cases, while MTL outperforms GSTL in 5 of 7 cases. These trends suggest that UGR is not needed once the regularization from incorporating auxiliary discriminative tasks takes effect. In other words, the parameter updates resulting from UGR are not as informative as the parameter updates resulting from having additional training examples from similar datasets. However, UGR may still be helpful when auxiliary training sets are not available.
Comparing STL to MTL results, we see that the DAN encoder often facilitates transfer across tasks. The best-performing MTL DAN model outperforms or equals the best-performing STL DAN model in 6 of 7 cases (all but Stance). The use of GloVe embeddings in MTL and GMTL improves performance over the use of randomly initialized embeddings because the task-independent informa-tion captured by the pre-trained word embeddings serves as good initialization.
Comparisons in training time, model size, and performance between the reimplemented ARS baseline model and the DAN model are given in Table 4 for MultiNLI 2.5% and Topic-5, the largest dataset and the dataset with the most auxiliary tasks, respectively. The DAN model is 33.4% smaller and 7.7x faster than the ARS model for MultiNLI 2.5% but achieves lower accuracy. DAN (run on a CPU) is 1.2x faster and 14.4% smaller than the ARS model (run on a GPU) for Topic-5 and achieves better performance. 12 As expected based on prior work, the training speed of the DAN encoder is substantially faster than that of the bi-RNN encoder, especially for MultiNLI 2.5% .
Although the competitive results of the bag-ofwords models are somewhat expected given prior work, we find the magnitude of the gains over the MTL bi-RNN reimplementation surprising, especially on Stance and Topic-2. Overall, our results extend the findings of prior work on simple sentence encoders for sentiment analysis and textual inference to the MTL setting.

Related Work
Prior work has shown that bag-of-words pooling encoders compete with sequence encoders on sentiment analysis, textual entailment, and textual similarity for single-task learning (Iyyer et al., 2015;Wieting et al., 2016;Arora et al., 2017). In this work, we explore these tasks in the MTL setting and ask if transfer among the tasks can be captured by bag-of-words features.
Recent work in MTL has explored different parameter sharing schemes in shared neural architectures. Some models incorporate inductive bias by imposing hierarchies over tasks (Søgaard and Table 3: Test results. Acc: accuracy; F M 1 : macro-averaged F 1 ; F F A 1 : macro-averaged F 1 of "favour" and "against" classes; ρ P N : macro-averaged recall, averaged across topics; M AE M : macro-averaged mean absolute error, averaged across topics. ↑/↓ next to each task name indicates that higher/lower score is better. "STL": single-task setting; "MTL": multi-task setting; "(r)": reimplementation of baseline bi-directional RNN model from ARS (no Label Embedding Layer or Label Transfer Network) Goldberg, 2016;Hashimoto et al., 2017;Sanh et al., 2019). Ruder et al. (2017) and Liu and Huang (2018) incorporate orthogonality constraints to learn which parameters tasks should share. Previous work in MTL has also lead to non-trivial training procedures. For example, Liu et al. (2017) and Chen and Cardie (2018) use adversarial training, and Ruder and Plank (2018) explore tri-training. The focus of this paper is a collection of BOW tools that form strong baselines upon which architectural or training improvements can be shown. Ando and Zhang (2005) motivate the inclusion of auxiliary tasks for MTL. They automatically annotate unlabeled data to create a new labeled dataset that is related to the main task. In this work, our auxiliary tasks are pre-existing labeled datasets for which we include discriminative and reconstruction objectives. Criteria and heuristics for the selection of auxiliary tasks are discussed by Alonso and Plank (2017) and .
For a given task, it is well-established that the addition of auxiliary word prediction objective terms may help regularize the representations used for prediction (Dai and Le, 2015;Kiros et al., 2015;Rei, 2017). Rei (2017) proposes a semi-supervised MTL framework for sequence tagging that incorporates a secondary language modeling objective. Like that approach, our unigram generative regularization ( §2.3) requires no additional data. Our approach differs from Rei (2017) in three ways: we employ a conditional language model instead of an unconditional language model, allowing our model to learn in a supervised way from signal derived from the labels; we do not use semi-supervised learning; and we train in a multi-task setting involving both multiple datasets and a compound objective, whereas Rei (2017) optimizes a compound objective on a single dataset for each task (similar to GSTL in Table 3 of this work). To the best of our knowledge, our use of (unigram) generative regularization in the multi-task setting is novel.

Conclusion
We showed that bag-of-words techniques such as pooling encoders and non-contextual pre-trained word embeddings can capture transfer among sentiment analysis and textual entailment tasks in multitask learning. We additionally showed that unigram generative regularization often improved singletask learning performance but not multi-task learning performance, suggesting that generative reg-ularization is not needed once the regularization from incorporating auxiliary discriminative tasks takes effect. The bag-of-words techniques are competitive with a state-of-the-art model, thereby extending the findings of prior work on bag-of-words approaches to sentiment analysis and textual entailment to the multi-task setting.