SHAPED: Shared-Private Encoder-Decoder for Text Style Adaptation

Supervised training of abstractive language generation models results in learning conditional probabilities over language sequences based on the supervised training signal. When the training signal contains a variety of writing styles, such models may end up learning an ‘average’ style that is directly influenced by the training data make-up and cannot be controlled by the needs of an application. We describe a family of model architectures capable of capturing both generic language characteristics via shared model parameters, as well as particular style characteristics via private model parameters. Such models are able to generate language according to a specific learned style, while still taking advantage of their power to model generic language phenomena. Furthermore, we describe an extension that uses a mixture of output distributions from all learned styles to perform on-the-fly style adaptation based on the textual input alone. Experimentally, we find that the proposed models consistently outperform models that encapsulate single-style or average-style language generation capabilities.


Introduction
Encoder-decoder models have recently pushed forward the state-of-the-art performance on a variety of language generation tasks, including machine translation (Bahdanau et al., 2015;Vaswani et al., 2017), text summarization (Rush et al., 2015;Nallapati et al., 2016;See et al., 2017), dialog systems (Li et al., 2016;Asghar et al., 2017), and image captioning (Xu et al., 2015;Ranzato et al., 2015;. This framework consists of an encoder that reads the input data and encodes it as a sequence of vectors, which is in turn used by a decoder to generate an- * Work done as an intern at Google AI. other sequence of vectors used to produce output symbols step by step. The prevalent approach to training such a model is to update all the model parameters using all the examples in the training data (over multiple epochs). This is a reasonable approach, under the assumption that we are modeling a single underlying distribution in the data. However, in many applications and for many natural language datasets, there exist multiple underlying distributions, characterizing a variety of language styles. For instance, the widely-used Gigaword dataset (Graff and Cieri, 2003) consists of a collection of articles written by various publishers (The New York Times, Agence France Presse, Xinhua News, etc.), each with its own style characteristics. Training a model's parameters on all the training examples results in an averaging effect across style characteristics, which may lower the quality of the outputs; additionally, this averaging effect may be completely undesirable for applications that require a level of control over the output style. At the opposite end of the spectrum, one can choose to train one independent model per each underlying distribution (assuming we have the appropriate signals for identifying them at training time). This approach misses the opportunity to exploit common properties shared by these distributions (e.g., generic characteristics of a language, such as noun-adjective position), and leads to models that are under-trained due to limited data availability per distribution.
In order to address these issues, we propose a novel neural architecture called SHAPED (sharedprivate encoder-decoder). This architecture has both shared encoder/decoder parameters that are updated based on all the training examples, as well as private encoder/decoder parameters that are updated using only examples from their corresponding underlying training distributions. In addition to learning different parametrization between the shared model and the private models, we jointly learn a classifier to estimate the probability of each example belonging to each of the underlying training distributions. In such a setting, the shared parameters ('shared model') are expected to learn characteristics shared by the entire set of training examples (i.e., language generic), whereas each private parameter set ('private model') learns particular characteristics (i.e., style specific) of their corresponding training distribution. At the same time, the classifier is expected to learn a probability distribution over the labels used to identify the underlying distributions present in the input data. At test time, there are two possible scenarios. In the first one, the input signal explicitly contains information about the underlying distribution (e.g., the publisher's identity). In this case, we feed the data into the shared model and also the corresponding private model, and perform sequence generation based on a concatenation of their vector outputs; we refer to this model as the SHAPED model. In a second scenario, the information about the underlying distribution is either not available, or it refers to a distribution that was not seen during training. In this case, we feed the data into the shared model and all the private models; the output distribution of the symbols of the decoding sequence is estimated using a mixture of distributions from all the decoders, weighted according to the classifier's estimates for that particular example; we refer to this model as the Mix-SHAPED model.
We test our models on the headline-generation task based on the aforementioned Gigaword dataset. When the publisher's identity is presented as part of the input, we show that the SHAPED model significantly surpasses the performance of the shared encoder-decoder baseline, as well as the performance of private models (where one individual, per-publisher model is trained for each in-domain style). When the publisher's identity is not presented as part of the input (i.e., not presented at run-time but revealed at evaluation-time for measurement purposes), we show that the Mix-SHAPED model exhibits a high level of classification accuracy based on textual inputs alone (accuracy percentage in the 80s overall, varying by individual publisher), while its generation accuracy still surpasses the performance of the baseline models. Finally, when the publisher's identity is unknown to the model (i.e., a publisher that was not part of the training dataset), we show that the Mix-SHAPED model performance far surpasses the shared model performance, due to the ability of the Mix-SHAPED model to perform on-the-fly adaptation of output style. This feat comes from our model's ability to perform two distinct tasks: match the incoming, previously-unseen input style to existing styles learned at training time, and use the correlations learned at training time between input and output style characteristics to generate style-appropriate token sequences.

Encoder-Decoder Models for Structured Output Prediction
Encoder-decoder architectures have been successfully applied to a variety of structure prediction tasks recently. Tasks for which such architectures have achieved state-of-the-art results include machine translation (Bahdanau et al., 2015;Vaswani et al., 2017), automatic text summarization (Rush et al., 2015;Chopra et al., 2016;Nallapati et al., 2016;Paulus et al., 2017;Nema et al., 2017), sentence simplification (Filippova et al., 2015;Zhang and Lapata, 2017), dialog systems (Li et al., 2016Asghar et al., 2017), image captioning Xu et al., 2015;Ranzato et al., 2015;, etc. By far the most used implementation of such architectures is based on the original sequenceto-sequence model (Sutskever et al., 2014), augmented with its attention-based extension (Bahdanau et al., 2015). Although our SHAPED and Mix-SHAPED model formulations do not depend on a particular architecture implementation, we do make use of the (Bahdanau et al., 2015) model to instantiate our models.

Domain Adaptation for Neural Network Models
One general approach to domain adaptation for natural language tasks is to perform data/feature augmentation that represents inputs as both general and domain-dependent data, as originally proposed in (Daumé III, 2009), and ported to neural models in (Kim et al., 2016). For computer vision tasks, a line of work related to our approach has been proposed by Bousmalis et al. (2016) using what they call domain separation networks. As a tool for studying unsupervised domain adaptation for image recognition tasks, their proposal uses CNNs for encoding an image into a feature representation, and also for reconstructing the input sample. It also makes use of a private encoder for each domain, and a shared encoder for both the source and the target domain. The approach we take in this paper shares this idea of model parametrization according to the domain/style, but goes further with the Mix-SHAPED model, performing on-the-fly adaptation of the model outputs. Other CNN-based domain adaptation methods for object recognition tasks are presented in (Long et al., 2016;Chopra et al., 2013;Tzeng et al., 2015;Sener et al., 2016).
For NLP tasks, Peng and Dredze (2017) take a multi-task approach to domain adaptation and sequence tagging. They use a shared encoder to represent instances from all of the domains, and use a domain projection layer to project the shared layer into a domain-specific space. They only consider the supervised domain-adaptation case, in which labeled training data exists for the target domain. Glorot et al. (2011) use auto-encoders for learning a high-level feature extraction across domains for sentiment analysis, while  employ auto-encoders to directly transfer the examples across different domains also for the same sentiment analysis task. Hua and Wang (2017) perform an experimental analysis on domain adaptation for neural abstractive summarization.
An important requirement of all the methods in the related work described above is that they require access to the (unlabeled) target domain data, in order to learn a domain-invariant representation across source and target domains. In contrast, our Mix-SHAPED model does not need access to a target domain or style at training time, and instead performs the adaptation on-the-fly, according to the specifics of the input data and the correlations learned at training time between available input and output style characteristics. As such, it is a more general approach, which allows adaptation for a much larger set of target styles, under the weaker assumption that there exists one or more styles present in the training data that can act as representative underlying distributions.

Model Architecture
Generally speaking, a standard encoder-decoder model has two components: an encoder that takes as input a sequence of symbols x = (x 1 , x 2 , ..., x Tx ) and encodes them into a set of vectors H = (h 1 , h 2 , ..., h Tx ), where f enc is the computation unit in the encoder; and, a decoder that generates output symbols at each time stamp t, conditioned on H as well as the decoder inputs y 1:t−1 , where f dec is the computation unit in the decoder.
Instantiations of this framework include the widely-used attention-based sequenceto-sequence model (Bahdanau et al., 2015), in which f enc and f dec are implemented by an RNN architecture using LSTM (Hochreiter and Schmidhuber, 1997) or GRU (Chung et al., 2014) units. A more recent instantiation of this architecture is the Transformer model (Vaswani et al., 2017), built using self-attention layers.

SHAPED: Shared-private encoder-decoder
The abstract encoder-decoder model described above is usually trained over all examples in the training data. We call such a model a shared encoder-decoder model, because the model parameters are shared across all training and test instances. Formally, the shared encoder-decoder consists of the computation units f s enc and f s dec . Given an instance x, it generates a sequence of vectors S s = (s s 1 , ...s s T ) by: The drawback of the shared encoder-decoder is that it fails to account for particular properties of each style that may be present in the data. In order to capture such particular style characteristics, a straightforward solution is to train a private model for each style. Assuming a style set D = {D 1 , D 2 ..., D |D| }, such a solution implies that each style has its own private encoder computation unit and decoder computation unit. At both training and testing time, each private encoder and decoder only process instances that belong to their own style. Given an instance along with its style (x, z) where z ∈ {1, . . . , |D|}, the private encoder-decoder generates a sequence of vectors S z = (s z 1 , ...s z T ) by: Figure 1: Illustration of the SHAPED model using two styles D 1 and D 2 . D 1 articles pass through the private encoder f 1 enc and decoder f 1 dec . D 2 articles pass through the private encoder f 2 enc and decoder f 2 dec . Both of them also go through the shared encoder f s dec and decoder f s dec .
Although the private encoder/decoder models do preserve style characteristics, they fail to take into account the common language features shared across styles. Furthermore, since each style is represented by a subset of the entire training set, such private models may end up as under-trained, due to limited number of available data examples. In order to efficiently capture both common and unique features of data with different styles, we propose the SHAPED model. In the SHAPED model, each data-point goes through both the shared encoder-decoder and its corresponding private encoder-decoder. At each step of the decoder, the output from private and shared ones are concatenated to form a new vector: that contains both private features for style z and shared features induced from other styles, as illustrated in Fig 1. The output symbol distribution over tokens o t ∈ V (where V is the output vocabulary) at step t is given by: where g is a multi-layer feed-forward network that maps s rz t to a vector of size |V |.
Given N training examples (x (1) , y (1) , z (1) ), . . . , (x (N ) , y (N ) , z (N ) ), the conditional probability of the output y (i) given article x (i) and its style z (i) ∈ {1, . . . , |D|} is: (7) At inference time, given an article x with style z, we feed x into f s enc , f s dec , f z enc , f z dec (Eq. 3-4) and obtain symbol distributions at each step t using Eq. 6. We sample from the distribution and obtain a symbol o t which will be used as the estimated y t and fed to the next steps.

The Mix-SHAPED Model
One limitation of the above model is that it can only handle test data containing an explicit style label from D = {D 1 , D 2 ..., D |D| }. However, there is frequently the case that, at test time, the style label is not present as part of the input, or that the input style is not part of the modeled set D.
We treat both of these cases similarly, as a case of modeling an unknown style. We first describe our treatment of such a case at run-time. We use a latent random variable z ∈ {1, . . . , |D|} to denote the underlying style of a given input. When generating a token at step t, the output token distribution takes the form of a mixture of SHAPED (Mix-SHAPED) model outputs: where p(o t |x, y 1:t−1 , z = d) is the output symbol distribution of SHAPED decoder d, evaluated as in Eq. 6. Fig. 2 contains an illustration of such a model. In this formulation, p(z|x) denotes the style conditional probability distribution from a trainable style classifier.
The joint data likelihood of target sequence y and target domain label z for input sequence x is: Training the Mix-SHAPED model involves minimizing a loss function that combines the negative log-likelihood of the style labels and the negative log-likelihood of the symbol sequences (see the model in Fig 3): log p(y (i) |x (i) , z (i) ).
Style Classifier g g Figure 2: Decoding data with unknown style using a Mix-SHAPED model. The data is run through all encoders and decoders. The output of private encoders is fed into a classifier that estimates style distribution. The output symbol distribution is a mixture over all decoder outputs.
At run-time, if the style d of the input is available and d ∈ D, we decode the sequence using Eq. 6. This also corresponds to the case p(z = d|x) = 1 and 0 for all other styles, and reduces Eq. 8 to Eq. 6. If the style of the input is unknown (or known, but with d ∈ D), we decode the sequence using Eq. 8, in which case the mixture over SHAPED models given by p(z|x) is approximating the desired output style.

Model Instantiation
As an implementation of the encoder-decoder model, we use the attention-based sequenceto-sequence model from (Bahdanau et al., 2015), with an RNN architecture using GRU units (Chung et al., 2014). The input token sequences are first projected into an embedding space via an embedding matrix E, resulting in a sequence of vectors as input representations.
The We apply the attention mechanism on s rz t , using which are normalized to a probability distribution: Context vectors are computed using normalized attention weights: Given the context vector and the hidden state vectors, the symbol distribution at step t is: p(o t |x, y 1:t , z) = softmax(g([c rz t , s rz t ])) (14) The attention weights in W a , U a , and v a , as well as the embedding matrix E and vocabulary V are shared by all encoders and decoders. We use Eq. 14 to calculate the symbol loss in Eq. 10.

Quantitative Experiments
We perform a battery of quantitative experiments, designed to answer several main questions: 1) Do the proposed model improve generation performance over alternative approaches? 2) Can a style classifier built using an auxiliary loss provide a reliable estimate on text style? 3) In the case of unknown style, does the Mix-SHAPED model improve generation performance over alternative approaches? 4) To what extent do our models capture style characteristics as opposed to, say, content characteristics?
We perform our experiments using text summarization as the main task. More precisely, we train and evaluate headline generation models using the publicly-available Gigaword dataset (Graff and Cieri, 2003;Napoles et al., 2012).

Headline-generation Setup
The Gigaword dataset contains news articles from seven publishers: Agence France-Presse (AFP), Associated Press Worldstream (APW), Central News Agency of Taiwan (CNA), Los Angeles Times/Washington Post Newswire Service (LTW), New York Times (NYT), Xinhua News Agency (XIN), and Washington Post/Bloomberg Newswire Service (WPB). We pre-process this dataset in the same way as in (Rush et al., 2015), which results in articles with average length 31.4 words, and headlines with average length 8.5 words.
We consider the publisher identity as a proxy for style, and choose to model as in-domain styles the set D = {AFP, APW, NYT, XIN}, while holding out CNA and LTW for out-of-domain style testing. This results in a training set containing the following number of (article, headline) instances: 993,584 AFP, 1,493,758 APW, 578,259 NYT, and 946,322 XIN. For the test set, we sample a total number of 10,000 in-domain examples from the original Gigawords test dataset, which include 2,886 AFP, 2,832 APW, 1,610 NYT, and 2,012 XIN. For out-of-domain testing, we randomly sample 10,000 LTW and 10,000 CNA test data examples. We remove the WPB articles due to their small number of instances.

Experimental Setup
We compare the following models: • A Shared encoder-decoder model (S) trained on all styles in D; • A suite of Private encoder-decoder models (P), each one trained on a particular style from D = {AFP, APW, NYT, XIN}; 1 • A SHAPED model (SP) trained on all styles in D; at test time, the style of test data is provided to the model; the article is only run through its style-specific private network and shared network (style classifier is not needed); • A Mix-SHAPED model (M-SP) trained on all styles in D; at test time, the style of article is not provided to the model; the output is computed using the mixture model, with the estimated style probabilities from the style classifier used as weights.
When testing on the out-of-domain styles CNA/LTW, we only compare the Shared (S) model with the Mix-SHAPED (M-SP) model, as the others cannot properly handle this scenario.
As hyper-parameters for the model instantiation, we used 500-dimension word embeddings, and a three-layer, 500-dimension GRU-cell RNN architecture; the encoder was instantiated as a bidirectional RNN. The lengths of the input and output sequences were truncated to 40 and 20 tokens, respectively. All the models were optimized using Adagrad (Duchi et al., 2011), with an initial learning rate of 0.01. The training procedure was done over mini-batches of size 128, and the updates were done asynchronously across 40 workers for 5M steps. The encoder/decoder word embedding and the output projection matrices were tied to minimize the number of parameters. To avoid the slowness from the softmax operator over large vocabulary sizes, and also mitigate the impact of outof-vocabulary tokens, we applied a subtokenization method , which invertibly transforms a native token into a sequence of subtokens from a limited vocabulary (here set to 32K).

Comparison with Previous Work
In the next section, we report our main results using the indomain and out-of-domain (w.r.t. the selected publisher styles) test sets described above, since these test sets have a balanced publisher style frequency that allows us to measure the impact of our style-adaptation models. However, we also report here the performance of our Shared (S) baseline model (with the above hyper-parameters) on the original 2K test set used in (Rush et al., 2015). On that test set, our S model obtains 30.13 F1 ROUGE-L score, compared to 28.34 ROUGE-L obtained by the ABS+ model (Rush et al., 2015), and 30.64 ROUGE-L obtained by the words-lvt2k-1sent model (Nallapati et al., 2016). This comparison indicates that our S model is a competitive baseline, making the comparisons against the SP and M-SP models meaningful when using our indomain and out-of-domain test sets.

Main Results
The Rouge scores for the in-domain testing data are reported in Table 1 (over the combined AFP/APW/XIN/NYT testset) and Fig. 4a (over individual-style test sets). The numbers indicate that the SP and M-SP models consistently outperform the S and P model, supporting the conclusion that the S model loses important characteristics due to averaging effects, while the P models miss the opportunity to efficiently exploit the training data. Additionally, the performance of SP is consistently better than M-SP in this setting, which indicates that the style label is helpful. As shown in Fig. 4b, the style classifier achieves around 80% accuracy overall in predicting the style under the M-SP model, with some styles (e.g., XIN) being easier to predict than others. The performance of the classifier is directly reflected in the quantitative difference between the SP and M-SP models on individual-style test sets (see Fig. 4a, where the XIN style has the smallest difference between the two models).
The evaluation results for the out-of-domain scenario are reported in Table 2. The numbers indicate that the M-SP model significantly outperforms the S model, supporting the conclusion that the M-SP model is capable of performing on-thefly adaptation of output style. This conclusion is further strengthened by the style probability distributions shown in Fig 5: they indicate that, for  the out-of-domain CNA style, the output mixture is heavily weighted towards the XIN style (0.6 of the probability mass), while for the LTW style, the output mixture weights heavily the NYT style (0.72 of the probability mass). This result is likely to reflect true style characteristics shared by these publishers, since both CNA and XIN are produced by Chinese news agencies (from Taiwan and mainland China, respectively), while both LTW and NYT are U.S. news agencies owned by the same media corporation.

Experiment Variants
Model capacity In order to remove the possibility that the improved performance of the SP model is due simply to an increased model size compared to the S model, we perform an experiment in which we triple the size of the GRU cell dimensions for the S model. However, we find no significant performance difference compared to the   Style embedding A competitive approach to modeling different styles is to directly encode the style information into the embedding space. In , the style label is converted into a one-hot vector and is concatenated with the word embedding at each time step in the S model. The outputs of this model are at 36.68 ROUGE-L, slightly higher than the baseline S model, but significantly lower than the SP model performance (37.52 ROUGE-L).
Another style embedding approach is to augment the S model with continuous trainable style embeddings for each predefined style label, similar to (Ammar et al., 2016). The resulting outputs achieve 37.2 ROUGE-L, which is better than the S model with one-hot style embedding, but still worse than the SP method (statistically significant at p-value=0.025 using paired t-test). However, neither of these approaches apply to the cases when the style is out-of-domain or unknown during testing. In contrast, such cases are handled naturally by the proposed M-SP model.
Ensemble model Another question is whether the SP model simply benefits from ensembling multiple models rather than style adaptation. To answer this question, we apply a uniform mixture over the private model output along with the shared model output, rather than using the learnt probability distribution from the style classifier. The ROUGE-1/2/L scores are 39.9/19.7/37.0. They are higher than the S model but significantly lower than the SP model and the M-SP model (pvalue 0.016). This result confirms that the information that the style classifier encodes is beneficiary, and leads to improved performance.

Style vs. Content
Previous experiments indicate that the SP and M-SP models have superior generation accuracy, but it is unclear to what extent the difference comes from improved modeling of style versus modeling of content. To clarify this issue, we performed an experiment in which we replace the named entities appearing in both article and headline with corresponding entity tags, in effect suppressing almost completely any content signal. For instance, given an input such as "China called Thursday on the parties involved in talks on North Korea's nuclear program to show flexibility as a deadline for implementing the first steps of a breakthrough deal approached.", paired with goldtruth output "China urges flexibility as NKorea deadline approaches", we replaced the named entities with their types, and obtained: "LOC 0 called Thursday on the ORG 0 involved in NON 2 on LOC 1 's NON 3 to show NON 0 as a NON 1 for implementing the first NON 4 of a NON 5 approached .", paired with "LOC 0 urges NON 0 as LOC 1 NON 1 approaches." Under this experimental conditions, both the SP and M-SP models still achieve significantly better performance compared to the S baseline. On the combined AFP/APW/XIN/NYT in-domain test set, the SP model achieves 61.70 ROUGE-L and M-SP achieves 61.52 ROUGE-L, compared to 60.20 ROUGE-L obtained by the S model. On the CNA/LTW out-of-domain test set, M-SP achieves 60.75 ROUGE-L, compared to 59.47 ROUGE-L by the S model.
In Table 3, we show an example which indi-article the org 2 is to forge non 1 with the org 3 located in loc 2 , loc 1 , the per 0 of the loc 0 org 4 said tuesday . title loc 0 org 0 to forge non 0 with loc 1 org 1 output by S org 0 to org 1 in non 0 output by M-SP loc 0 org 0 to forge non 0 with loc 1 org 1 article loc 0 -born per 0 per 0 will pay non 1 here next month to per 1 , the org 2 ( org 1 ) per 1 who per 1 perished in an non 2 in february , the org 3 said thursday . title per 0 to pay non 0 to late org 1 org 0 output by S per 0 to visit org 0 in non 0 output by M-SP per 0 to pay non 0 to org 1 org 0 Table 3: Examples of input article (and groundtruth title) and output generated by S and M-SP. Named entities in the training instances (both article and title) are replaced the entity type.
cates the ability of style adaptation benifiting summarization. For instance, we find that both CNA and XIN make more frequent use of the style pattern "xxx will/to [verb] yyy . . ., zzz said ???day" (about 15% of CNA articles contain this pattern, while only 2% of LTW articles have it). From Table 3, we can see that the S model sometimes misses or misuses the verb in its output, while the M-SP model does a much better job at capturing both the verb/action as well as other relations (via prepositions, etc.) Fig. 6 shows the estimated style probabilities over the four styles AFP/APW/XIN/NYT for CNA and LTW, under this experiment condition. We observe that, in this version as well, CNA is closely matching the style of XIN, while LTW is matching that of NYT. The distribution is similar to the one in Fig. 5, albeit a bit flatter as a result of content removal. As such, it supports the conclusion that the classifier indeed learns style (in addition to content) characteristics.

Conclusion
In this paper, we describe two new styleadaptation model architectures for text sequence generation tasks, SHAPED and Mix-SHAPED. Both versions are shown to significantly outperform models that are either trained in a manner that ignores style characteristics (and hence exhibit a style-averaging effect in their outputs), or models that are trained single-style.
The latter is a particularly interesting result, as a model that is trained (with enough data) on a single-style and evaluated on the same style would be expected to exhibit the highest performance. Our results show that, even for single-style models trained on over 1M examples, their performance is inferior to the performance of SHAPED models on that particular style.
Our conclusion is that the proposed architectures are both efficient and effective in modeling both generic language phenomena, as well as particular style characteristics, and are capable of producing higher-quality abstractive outputs that take into account style characteristics.