Semi-supervised Stochastic Multi-Domain Learning using Variational Inference

Supervised models of NLP rely on large collections of text which closely resemble the intended testing setting. Unfortunately matching text is often not available in sufficient quantity, and moreover, within any domain of text, data is often highly heterogenous. In this paper we propose a method to distill the important domain signal as part of a multi-domain learning system, using a latent variable model in which parts of a neural model are stochastically gated based on the inferred domain. We compare the use of discrete versus continuous latent variables, operating in a domain-supervised or a domain semi-supervised setting, where the domain is known only for a subset of training inputs. We show that our model leads to substantial performance improvements over competitive benchmark domain adaptation methods, including methods using adversarial learning.


Introduction
Text corpora are often collated from several different sources, such as news, literature, microblogs, and web crawls, raising the problem of learning NLP systems from heterogenous data, and how well such models transfer to testing settings. Learning from these corpora requires models which can generalise to different domains, a problem known as transfer learning or domain adaptation (Blitzer et al., 2007;Daumé III, 2007;Joshi et al., 2012;Kim et al., 2016). In most stateof-the-art frameworks, the model has full knowledge of the domain of instances in the training data, and the domain is treated as a discrete indicator variable. However, in reality, data is often messy, with domain labels not always available, or providing limited information about the style and genre of text. For example, web-crawled corpora are comprised of all manner of text, such as news, marketing, blogs, novels, and recipes, how-ever the type of each document is typically not explicitly specified. Moreover, even corpora that are labelled with a specific domain might themselves be instances of a much more specific area, e.g., "news" articles will cover politics, sports, travel, opinion, etc. Modelling these types of data accurately requires knowledge of the specific domain of each instance, as well as the domain of each test instance, which is particularly problematic for test data from previously unseen domains.
A simple strategy for domain learning is to jointly learn over all the data with a single model, where the model is not conditioned on domain, and directly maximises p(y|x), where x is the text input, and y the output (e.g. classification label). Improvements reported in multi-domain learning (Daumé III, 2007;Kim et al., 2016) have often focused on learning twin representations (shared and private representations) for each instance. The private representation is modelled by introducing a domain-specific channel conditional on the domain, and the shared one is learned through domain-general channels. To learn more robust domain-general and domain-specific channels, adversarial supervision can be applied in the form of either domain-conditional or domain-generative methods (Liu et al., 2016;Li et al., 2018a).
Inspired by these works, we develop a method for the setting where the domain is unobserved or partially observed, which we refer to as unsupervised and semi-supervised, respectively, with respect to domain. This has the added benefit of affording robustness where the test data is drawn from an unseen domain, through modelling each test instance as a mixture of domains. In this paper, we propose methods which use latent variables to characterise the domain, by modelling the discriminative learning problem p(y|x) = z p(z|x)p(y|x, z), where z encodes the domain, which must be marginalised out when the domain is unobserved. We propose a sequence of models of increasing complexity in the modelling of the treatment of z, ranging from a discrete mixture model, to a continuous vector-valued latent variable (analogous to a topic model; Blei et al. (2003)), modelled using Beta or Dirichlet distributions. We show how these models can be trained efficiently, using either direct gradient-based methods or variational inference , for the respective model types. The variational method can be applied to domain and/or label semi-supervised settings, where not all components of the training data are fully observed.
We evaluate our approach using sentiment analysis over multi-domain product review data and 7 language identification benchmarks from different domains, showing that in out-of-domain evaluation, our methods substantially improve over benchmark methods, including adversariallytrained domain adaptation (Li et al., 2018a). We show that including additional domain unlabelled data gives a substantial boost to performance, resulting in transfer models that often outperform domain-trained models, to the best of our knowledge, setting a new state of the art for the dataset.

Stochastic Domain Adaptation
In this section, we describe our proposed approaches to Stochastic Domain Adaptation (SDA), which use latent variables to represent an implicit 'domain'. This is formulated as a joint model of output classification label, y and latent domain z, which are both conditional on x, The two components are the prior, p φ (z|x), and classifier likelihood, p θ (y|x, z), which are parameterised by φ and θ, respectively. We propose several different choices of prior, based on the nature of z, that is, whether it is: (i) a discrete value ("DSDA", see Section 2.2); or (ii) a continuous vector, in which case we experiment with different distributions to model p(z|x) ("CSDA", see Section 2.3).

Stochastic Channel Gating
For all of our models the likelihood, p θ (y|x, z), is formulated as a multi-channel neural model, where z is used as a gate to select which channels should be used in representing the input. The model comprises k channels, with each channel computing an independent hidden representation, using a convolutional neural network. 1 The value of z is then used to select the channel, by computing h = k i=1 z k h i , where we assume z ∈ R k is a continuous vector. For the discrete setting, we represent integer z by its 1-hot encoding z, in which case h = h z . The final step of the likelihood passes h through a MLP with a single hidden layer, followed by a softmax, which is used to predict class label y.

Discrete Domain Identifiers
We now turn to the central part of our method, the prior component. The simplest approach, DSDA (see Figure 1a), uses a discrete latent variable, i.e., z ∈ [1, k] is an integer-valued random variable, and consequently the model can be considered as a form of mixture model. This prior predicts z given input x, which is modelled using a neural network with a softmax output. Given z, the process of generating y is as described above in Section 2.1. The discrete model can be trained for the maximum likelihood estimate using the objective, log p(y|x) = log k z=1 p φ (z|x)p θ (y|x, z), (1) which can be computed tractably, 2 and scales linearly in k.
DSDA can be applied with supervised or semisupervised domains, by maximising the likelihood p(z = d|x) when the ground truth domain d is observed. We refer to this setting as "DSDA +sup." or "DSDA +semisup", respectively, noting that in this setting we assume the number of channels, k, is equal to the known inventory of domains, D.

Continuous Domain Identifiers
For the DSDA model to work well requires sufficiently large k, such that all the different types of data can be clearly separated into individual mixture components. When there is not a clear delineation between domains, the inferred domain posterior is likely to be uncertain, and the approach Figure 1: Model architectures for latent variable models, DSDA and CSDA, which differ in the treatment of the latent variable, which is discrete (d ∈ [1, k]), or a continuous vector (ẑ ∈ R k ). The lower green model components show k independent convolutional network components, and the blue and yellow component the prior, p, and the variational approximation, q, respectfully. The latent variable is used to gate the k hidden representations (shown as ), which are then used in a linear function to predict a classification label, y. During training CSDA draws samples (∼) from q, while during inference, samples are drawn from p.
reduces to an ensemble technique. Thus, we introduce the second modelling approach as Continuous domain identifiers (CSDA), inspired by the way in which LDA models the documents as mixtures of several topics (Blei et al., 2003). A more statistically efficient method would be to use binary functions as domain specifiers, i.e., z ∈ {0, 1} k , effectively allowing for exponentially many domain combinations (2 k ). Each element of the domain z i acts as a gate, or equivalently, attention, governing whether hidden state h i is incorporated into the predictive model. In this way, individual components of the model can specialise to a very specific topic such as politics or sport, and yet domains are still able to combine both to produce specialised representations, such as the politics of sport. The use of a latent bit-vector renders inference intractable, due to the marginalisation over exponentially many states. For this reason, we instead make a continuous relaxation, such that z ∈ R k with each scalar z i being drawn from a probability distribution parameterised as a function of the input x. These functions can learn to relate aspects of x with certain domain indexes, e.g., the use of specific words like baseball and innings relate to a domain corresponding to "sport", thereby allowing the text domains to be learned automatically.
Several possible distributions can be used to model z ∈ R k . Here we consider the following distributions: Beta which bounds all elements to the range [0, 1], such that z lies in a hyper-cube; Dirichlet which also bounds all elements, as for Beta, however z are also constrained to lie in the probability simplex.
In both cases, 3 each dimension of z is controlled by different distribution parameters, themselves formulated as different non-linear functions of x.
We expect the Dirichlet model to perform the best, based on their widespread use in topic models, and their desirable property of generating a normalised vector, resembling common attention mechanisms (Bahdanau et al., 2015). Depending on the choice of distribution, the prior is modelled as where the prior parameters are parameterised as neural networks of the input. For the Beta prior, where elu(·) + 1 is an element-wise activation function which returns a positive value (Clevert et al., 2016), and f ω (·) is a nonlinear function with parameters ω-here we use a CNN. The Dirichlet prior uses a different parameterisation, where α 0 is a positive-valued overall concentration parameter, used to scale all components in (2b), thus capturing overall sparsity, while α α α D models the affinity to each channel.

Variational Inference
Using continuous latent variables, as described in Section 2.3, gives rise to intractable inference; for this reason we develop a variational inference method based on the variational auto-encoder (Kingma and Welling, 2014). Fitting the model involves maximising the evidence lower bound (ELBO), where q σ is the variational distribution, parameterised by σ, chosen to match the family of the prior (Beta or Dirichlet) and λ is a hyperparameter controlling the weight of the KL term. The ELBO in (5) is maximised with respect to σ, φ and θ, using stochastic gradient ascent, where the expectation term is approximated using a single sample,ẑ ∼ q σ , which is used to compute the likelihood directly. Although it is not normally possible to backpropagate gradients through a sample, which is required to learn the variational parameters σ, this problem is usually sidestepped using a reparameterisation trick . However this method only works for a limited range of distributions, most notably the Gaussian distribution, and for this reason we use the implicit reparameterisation gradient method (Figurnov et al., 2018), which allows for inference with a variety of continuous distributions, including Beta and Dirichlet. We give more details of the implicit reparameterisation method in Appendix A.2. The variational distribution q, is defined in an analagous way to the prior, p, see (2-4b), i.e., using a neural network parameterisation for the distribution parameters. The key difference is that q conditions not only on x but also on the target label y and domain d. This is done by embedding both y and d, which are concatenated with a CNN encoding of x, and then transformed into the distribution parameters. Semi-supervised learning with respect to the domain can easily be facilitated by setting d to the domain identifier when it is observed, otherwise using a sentinel value d = UNK, for domain-unsupervised instances. The same trick is used for y, to allow for vanilla semi-supervised learning (with respect to target label). The use of y and d allows the inference network to learn to encode these two key variables into z, to encourage the latent variable, and thus model channels, to be informative of both the target label and the domain. This, in concert with the KL term in (5), ensures that the prior, p, must also learn to discriminate for domain and label, based solely on the input text, x.
For inference at test time, we assume that only x is available as input, and accordingly the inference network cannot be used. Instead we generate a sample from the priorẑ ∼ p(z|x), which is then used to compute the maximum likelihood label, y = arg max y p(y|x,ẑ). We also experimented with Monte Carlo methods for test inference, in order to reduce sampling variance, using: (a) prior meanz = µ; (b) Monte Carlo averagingȳ = 1 m i p(y|x,ẑ i ) using m = 100 samples from the prior; and (c) importance sampling (Glynn and Iglehart, 1989) to estimate p(y|x) based on sampling from the inference network, q. 4 None of the Monte Carlo methods showed a significant difference in predictive performance versus the single sample technique, although they did show a very tiny reduction in variance over 10 runs. This is despite their being orders of magnitude slower, and therefore we use a single sample for test inference hereafter.

Multi-domain Sentiment Analysis
To evaluate the proposed models, we first experiment with a multi-domain sentiment analysis dataset, focusing on out-of-domain evaluation where the test domain is unknown.
We derive our dataset from Multi-Domain Sentiment Dataset v2.0 (Blitzer et al., 2007). 5 The task is to predict a binary sentiment label, i.e., positive vs. negative. The unprocessed dataset has more than 20 domains. For our purposes, we filter out domains with fewer than 1k labelled instances 4 Importance sampling estimates p(y|x) = Eq[p(y, z|x)/q(z|x, y, d)] for each setting of y using m = 100 samples from q, and then finds the maximising y. This is tractable in our settings as y is a discrete variable, e.g., a binary sentiment, or multiclass language label. 5 From https://www.cs.jhu.edu/˜mdredze/ datasets/sentiment/.  or fewer than 2k unlabelled instances, resulting in 13 domains in total.
To simulate the semi-supervised domain situation, we remove the domain attributions for one half of the labelled data, denoting them as domainunlabelled data Y(x, y, ?). The other half are sentiment-and domain-labelled data F(x, y, d).
We present a breakdown of the dataset in Table 1. 6 For evaluation, we hold out four domainsnamely books ("B"), dvds ("D"), electronics ("E"), and kitchen & housewares ("K")-for comparability with previous work (Blitzer et al., 2007). Each domain has 1k test instances, and we split this data into dev and test with ratio 4:6. The dev dataset is used for hyper-parameter tuning and early stopping, 7 and we report accuracy results on test.

Baselines and Comparisons
For comparison, we use 3 baselines. The first is a single channel CNN ("S-CNN"), which jointly over all data instances in a single model, without domain-specific parameters. The second baseline is a multi channel CNN ("M-CNN"), which expands the capacity of the S-CNN model (606k parameters) to match CSDA and DSDA (roughly 7.5m-8.3m parameters). Our third baseline is a multi-domain learning approach using adversarial learning for domain generation ("GEN"), the bestperforming model of Li et al. (2018a) and state-ofthe-art for unsupervised multi-domain adaptation over a comparable dataset. 8 We report results for 6 The dataset, along with the source code, can be found at https://github.com/lrank/Code_ VariationalInference-Multidomain 7 This confers light supervision in the target domain. However we would expect similar results were we to use disjoint held out domains for development wrt testing. 8 The dataset used in Li et al. (2018a) differs slightly in that it is also based off Multi-Domain Sentiment Dataset v2.0, their best performing GEN +d+g model.

Training Strategy
For the hyper-parameter setups, we provide the details in Appendix A.1. In terms of training, we simulate two scenarios using two experimental configurations, as discussed above: (a) domain supervision; and (2) domain semi-supervision. For domain supervised training, only F is used, which covers only 9 of the domains, and the test domain data is entirely unseen. For domain semisupervised training, we use combinations of F and Y, noting that both sub-corpora do not include data from the target domains, and none of which is explicitly labelled with sentiment, y, and domain, d. These simulate the setting where we have heterogenous data which includes a lot of relevant data, however its metadata is inconsistent, and thus cannot be easily modelled.
For λ in (5), according to the derivation of the ELBO it should be the case that λ = 1, however other settings are often justified in practice (Alemi et al., 2018). Accordingly, we tried both annealing and fixed schedules, but found no consistent differences in end performance. We performed a grid search for the fixed value, λ = 10 a , a ∈ {−3, −2, −1, 0, 1}, and selected λ = 10 −1 , based on development performance. We provide further analysis in the form of a sensitivity plot in Section 3.2. The latent domain size k for DSDA is set to the true number of training domains k = D = 9. Note that, even for DSDA, we could use k = D, which we explore in the F + Y supervision setting in Section 3.1.3. For CSDA we present the main results with k = 13, set to match the total number of domains in training and testing. Table 2 reports the performance of different models under two training configurations: (1) with F + Y (domain semi-supervised learning); and (2) with F only (domain supervised learning). In each case, we report the standard deviation based on 10 runs with different random seeds.

Results
Overall, domain B and D are more difficult than E and K, consistent with previous work. Comparing the two configurations, we see that when we use domain semi-supervised training (with the addition of Y), all models perform betbut uses slightly more training domains and a slightly different composition of training data. We retrain the model of the authors over our dataset, using their implementation.  ter, demonstrating the utility of domain semisupervised learning when annotated data is limited. Comparing our discrete and continuous approaches (DSDA and DSDA, resp.), we see that CSDA consistently performs the best, outperforming the baselines by a substantial margin. In contrast DSDA is disappointing, underperforming the baselines, and moreover, shows no change in performance between domain supervision versus the semi-supervised or unsupervised settings. Among the CSDA based methods, all the distributions perform well, but the Dirichlet distribution performs the best overall, which we attribute to better modelling of the sparsity of domains, thus reducing the influence of uncertain and mixed domains. The best results are for domain semi-supervised learning (F + Y), which brings an increase in accuracy of about 2% over domain supervised learning (F) consistently across the different types of model.

Analysis and Discussion
To better understand what the model learns, we focus on the CSDA model, using the Dirichlet distribution.
First, we consider the model capacity, in terms of the latent domain size, k. Figure 2 shows the impact of varying k. Note that the true number of domains is D = 13, comprising 9 training and 4 test domains. Setting k to roughly this value appears to be justified, in that the mean accuracy increases with k, and plateaus around k = 16. Interestingly, when k ≥ 32, the performance of CSDA with Beta drops, while performance for Dirichlet remains high-indeed Dirichlet is consistently superior even at the extreme value of k = 2, although it does show improvement as k increases. Also observe that DSDA requires a large latent state inventory, supporting our argument for the efficiency of continuous cf. discrete latent variables.
Next, we consider the impact of using different combinations of F and Y.   Y on its own is only a little worse than only F, showing that target labels y are more important for learning than the domain d. The Y configuration fully domain unsupervised training still results in decent performance, boding well for application to very messy and heterogenous datasets with no domain metadata.
Finally, we consider what is being learned by the model, in terms of how it learns to use the k dimensional latent variables for different types of data. We visualise the learned representations, showing points for each domain plotted in a 2d t-SNE plot (Maaten and Hinton, 2008) in Figure 3. Notice that each domain is split into two clusters, representing positive (× × ×) and negative (•) instances within that domain. Among the test domains, B (books) and D (dvds) are clustered close together but are still clearly separated, which is encouraging given the close relation between these two media. The other two, E (electronics) and K (kitchen & housewares) are mixed together and intermingled with other domains. Overall across all domains, the APPAREL cluster is quite distinct, while VIDEO and MUSIC are highly associated with D, and part of the cluster for MAGAZINES is close to B; all of these make sense intuitively, given similarities between the respective products. E is related to CAMERA and GAMES, while K is most closely connected to HEALTH and SPORTS.
To obtain a better understanding of what is being encoded in the latent variable, and how this is effected by the setting of λ, we learn simple diagnostic classifiers to predict sentiment label y and domain label d, given only z as input. To do so, we first train our model over the training set, and  record samples of z from the inference network. We then partition the training set, using 70% to learn linear logistic regression classifiers to predict y and d, and use the remaining 30% for evaluation. Figure 4 shows the prediction accuracy, based on averaging over three runs, each with different z samples. Clearly very small λ ≤ 10 −2 , leads to almost perfect sentiment label accuracy which is evidence of overfitting by using the latent variable to encode the response variable. For λ ≥ 10 −1 the sentiment accuracy is still above chance, as expected, but is more stable. For the domain label d, the predictive accuracy is also above chance, albeit to a lesser extent, and shows a similar downward trend. At the setting λ = 0.1, used in the earlier experiments, this shows that the latent variable encodes captures substantial sentiment, and some domain knowledge, as observed in Figure 3. In terms of the time required for training, a single epoch of training took about 25min for the CSDA method, using the default settings, and a similar time for DSDA and M-CNN. The runtime increases sub-linearly with increasing latent size k.

Language Identification
To further demonstrate our approaches, we then evaluate our models with the second task, language identification (LangID: Jauhiainen et al. (2018)).
For data processing, we use 5 training sets from 5 different domains with 97 language, following the setup of Lui and Baldwin (2011). We evaluate accuracy over 7 holdout benchmarks: EUROGOV, TCL, WIKIPEDIA from Baldwin and Lui (2010), EMEA (Tiedemann, 2009), EUROPARL (Koehn, 2005), TBE (Tromp and Pechenizkiy, 2011) and TSC (Carter et al., 2013). Differently from sentiment tasks, here, we evaluate our methods using the full dataset, but with two configurations: (1) domain unsupervised, where all instance have only labels but no domain (denoted Y); and (2) domain supervised learning, where all instances have labels and domain (F). Table 4 shows the performance of different models over 7 holdout benchmarks and the averaged scores. We also report the results of GEN, the best model from Li et al. (2018a), and one state-of-theart off-the-shelf LangID tool: LANGID.PY (Lui and Baldwin, 2012). Note that, both S-CNN and M-CNN are domain unsupervised methods. In terms of results, overall, both of our CSDA models consistently outperform all other baseline models. Comparing the different CSDA variants, Beta vs. Dirichlet, both perform closely across the LangID tasks. Furthermore, CSDA out-performs the state-of-the-art in terms of average scores. Interestingly the two training configurations show that domain knowledge F provides a small performance boost for CSDA, but not does help for DSDA. Above all, the LangID results confirm the effectiveness of our proposed approaches. knowledge of the target domain (Blitzer et al., 2007;Glorot et al., 2011). Adversarial learning methods have been proposed for learning robust domain-independent representations, which can capture domain knowledge through semisupervised learning (Ganin et al., 2016).

Results
Multi-domain adaptation uses training data from more than one training domain. Approaches include feature augmentation methods (Daumé III, 2007), and analagous neural models (Joshi et al., 2012;Kim et al., 2016), as well as attentionbased and hierarchical methods (Li et al., 2018b). These works assume the 'oracle' source domain is known when transferring, however we do not require an oracle in this paper. Adversarial training methods have been employed to learn robust domain-generalised representations (Liu et al., 2016). Li et al. (2018a) considered the case of the model having no access to the target domain, and using adversarial learning to generate domaingeneration representations by cross-comparison between source domains.
The other important component of this work is Variational Inference ("VI"), a method from machine learning that approximates probability densities through optimisation . The idea of a variational auto-encoder has been applied to language generation (Bowman et al., 2016;Kim et al., 2018;Miao et al., 2017;Zhou and Neubig, 2017;Zhang et al., 2016) and machine translation (Shah and Barber, 2018;Eikema and Aziz, 2018), but not in the context of semi-supervised domain adaptation.

Conclusion
In this paper, we have proposed two models-DSDA and CSDA-for multi-domain learning, which use a graphical model with a latent variable to represent the domain. We propose models with a discrete latent variable, and a continuous vectorvalued latent variable, which we model with Beta or Dirichlet priors. For training, we adopt a variational inference technique based on the variational autoencoder. In empirical evaluation over a multi-domain sentiment dataset and seven language identification benchmarks, our models outperform strong baselines, across varying data conditions, including a setting where no target domain data is provided. Our proposed models have broad utility across NLP applications on heterogenous corpora.