Unsupervised Neural Text Simplification

The paper presents a first attempt towards unsupervised neural text simplification that relies only on unlabeled text corpora. The core framework is composed of a shared encoder and a pair of attentional-decoders, crucially assisted by discrimination-based losses and denoising. The framework is trained using unlabeled text collected from en-Wikipedia dump. Our analysis (both quantitative and qualitative involving human evaluators) on public test data shows that the proposed model can perform text-simplification at both lexical and syntactic levels, competitive to existing supervised methods. It also outperforms viable unsupervised baselines. Adding a few labeled pairs helps improve the performance further.


Introduction
Text Simplification (TS) deals with transforming the original text into simplified variants to increase its readability and understandability. TS is an important task in computational linguistics, and has numerous use-cases in fields of education technology, targeted content creation, language learning, where producing variants of the text with varying degree of simplicity is desired. TS systems are typically designed to simplify from two different linguistic aspects: (a) Lexical aspect, by replacing complex words in the input with simpler synonyms (Devlin, 1998;Candido Jr et al., 2009;Yatskar et al., 2010;Biran et al., 2011;Glavaš andŠtajner, 2015), and (b) Syntactic aspect, by altering the inherent hierarchical structure of the sentences (Chandrasekar and Srinivas, 1997;Canning and Tait, 1999;Siddharthan, 2006;Filippova and Strube, 2008;Brouwers et al., 2014). From the perspective of sentence construction, sentence simplification can be thought to be a form of text-transformation that involves three major types of operations such as (a) splitting (Siddharthan, 2006;Petersen and Ostendorf, 2007;Narayan and Gardent, 2014) (b) deletion/compression (Knight and Marcu, 2002;Clarke and Lapata, 2006;Filippova and Strube, 2008;Rush et al., 2015;Filippova et al., 2015), and (c) paraphrasing (Specia, 2010;Coster and Kauchak, 2011;Wubben et al., 2012;Wang et al., 2016;Nisioi et al., 2017).
Most of the current TS systems require largescale parallel corpora for training (except for systems like Glavaš andŠtajner (2015) that performs only lexical-simplification), which is a major impediment in scaling to newer languages, use-cases, domains and output styles for which such largescale parallel data do not exist. In fact, one of the popular corpus for TS in English language, i.e., the Wikipedia-SimpleWikipedia aligned dataset has been prone to noise (mis-aligned instances) and inadequacy (i.e., instances having non-simplified targets) (Xu et al., 2015;Štajner et al., 2015), leading to noisy supervised models (Wubben et al., 2012). While creation of better datasets (such as, Newsela by Xu et al. (2015)) can always help, we explore the unsupervised learning paradigm which can potentially work with unlabeled datasets that are cheaper and easier to obtain.
At the heart of the TS problem is the need for preservation of language semantics with the goal of improving readability. From a neural-learning perspective, this entails a specially designed autoencoder, which not only is capable of reconstructing the original input but also can additionally introduce variations so that the auto-encoded output is a simplified version of the input. Intuitively, both of these can be learned by looking at the structure and language patterns of a large amount of non-aligned complex and simple sentences (which are much cheaper to obtain compared to aligned parallel data). These motivations form the basis of our work.
Our approach relies only on two unlabeled text corpora -one representing relatively simpler sentences than the other (which we call complex).
The crux of the (unsupervised) auto-encoding framework is a shared encoder and a pair of attention-based decoders (one for each type of corpus). The encoder attempts to produce semanticspreserving representations which can be acted upon by the respective decoders (simple or complex) to generate the appropriate text output they are designed for. The framework is crucially supported by two kinds of losses: (1) adversarial loss -to distinguish between the real or fake attention context vectors for the simple decoder, and (2) diversification loss -to distinguish between attention context vectors of the simple decoder and the complex decoder. The first loss ensures that only the aspects of semantics that are necessary for simplification are passed to the simple decoder in the form of the attention context vectors. The second loss, on the other hand, facilitates passing different semantic aspects to the different decoders through their respective context vectors. Also we employ denoising in the auto-encoding setup for enabling syntactic transformations.
The framework is trained using unlabeled text collected from Wikipedia (complex) and Simple Wikipedia (simple). It attempts to perform simplification both lexically and syntactically unlike prevalent systems which mostly target them separately. We demonstrate the competitiveness of our unsupervised framework alongside supervised skylines through both automatic evaluation metrics and human evaluation studies. We also outperform another unsupervised baseline (Artetxe et al., 2018b), first proposed for neural machine translation. Further, we demonstrate that by leveraging a small amount of labeled parallel data, performance can be improved further. Our code and a new dataset containing partitioned unlabeled sets of simple and complex sentences is publicly available 1 .

Related Work
Text Simplification has often been discussed from psychological and linguistic standpoints (L'Allier, 1980;McNamara et al., 1996;Linderholm et al., 2000). A heuristic-based system was first introduced by Chandrasekar and Srinivas (1997) which induces rules for simplification automatically extracted from annotated corpora. Canning and Tait (1999) proposed a modular system that uses NLP tools such as morphological analyzer, POS tagger 1 https://github.com/subramanyamdvss/UnsupNTS plus heuristics to simplify the text both lexically and syntactically. Most of these systems (Siddharthan, 2014) are separately targeted towards lexical and syntactic simplification and are limited to splitting and/or truncating sentences. For paraphrasing based simplification, data-driven approaches were proposed like phrase-based SMT (Specia, 2010;Štajner et al., 2015) or their variants (Coster and Kauchak, 2011;Xu et al., 2016), that combine heuristic and optimization strategies for better TS. Recently proposed TS systems are based on neural seq2seq architecture (Bahdanau et al., 2014) which is modified for TS specific operations (Wang et al., 2016;Nisioi et al., 2017). While these systems produce state of the art results on the popular Wikipedia dataset (Coster and Kauchak, 2011), they may not be generalizable because of the noise and bias in the dataset (Xu et al., 2015) and overfitting. Towards this,Štajner and Nisioi (2018) showed that improved datasets and minor model changes (such as using reduced vocabulary and enabling copy mechanism) help obtain reasonable performance for both in-domain and cross-domain TS.
In the unsupervised paradigm, Paetzold and Specia (2016) proposed an unsupervised lexical simplification technique that replaces complex words in the input with simpler synonyms, which are extracted and disambiguated using word embeddings. However, this work, unlike ours only addresses lexical simplification and cannot be trivially extended for other forms of simplification such as splitting and rephrasing. Other works related to style transfer Shen et al., 2017;Xu et al., 2018) typically look into the problem of sentiment transformation and are not motivated by the linguistic aspects of TS, and hence not comparable to our work. As far as we know, ours is a first of its kind end-to-end solution for unsupervised TS. At this point, though supervised solutions perform better than unsupervised ones, we believe unsupervised techniques should be further explored since they hold greater potential with regards to scalability to various tasks.

Model Description
Our system is built based on the encode-attenddecode style architecture (Bahdanau et al., 2014) with both algorithmic and architectural changes applied to the standard model. An input sequence of word embeddings X = {x 1 , x 2 , . . . , x n } (ob- tained after a standard look up operation on the embedding matrix), is passed through a shared encoder (E), the output representation from which is fed to two decoders (G s , G d ) with attention mechanism. G s is meant to generate a simple sentence from the encoded representation, whereas G d generates a complex sentence. A discriminator (D) and a classifier (C) are also employed adversarially to distinguish between the attention context vectors computed with respect to the two decoders. Figure 1 is illustrates our system. We describe the components below.

Encode-Attend-Decode Model
Encoder E uses two layers of bi-directional GRUs (Cho et al., 2014b), and decoders G s , G d have two layers of GRUs each. E extracts the hidden representations from an input sentence. The decoders output sentences, sequentially one word at a time. Each decoder-step involves using global attention to create a context-vector (hidden representations weighted by attention weights) as an input for the next decoder-step. The attention mechanism enables the decoders to focus on different parts of the input sentence. For the input sentence X with n words, the encoder produces n hidden representations, H = {h 1 , h 2 , . . . , h n }. The context vector extracted from X by a decoder G for time-step t is represented as, where, a it denotes attention weight for the hidden representation at the i th input position with respect to decoder-step t. As there are two decoders, A st (X) and A dt (X) denote the context vectors computed from decoders G s and G d respectively for time-steps t ∈ {1 . . . m}, m denoting the total number of decoding steps performed 2 . The matrices A s (X) and A d (X) represent the sequence of respective context vectors from all time-steps.

Discriminator and Classifier
A discriminator D is employed to influence the way the decoder G s will attend to the hidden representations, which has to be different for different types of inputs to the shared encoder E (simple vs complex). The input to D is the context vector sequence matrix A s pertaining to G s , and it produces a binary output, {1, 0}, 1 indicating the fact that the context vector sequence is close to a typical context vector sequence extracted from simple sentences seen in the dataset. G s and D are indulged in an adversarial interplay through an adversarial loss function (see Section 4.2), analogous to GANs (Goodfellow et al., 2014), where the generator and discriminators, converge to a point where the distribution of the generations eventually resembles the distribution of the genuine samples. In our case, adversarial loss tunes the context vector sequence from a complex sentence by G s to ultimately resemble the context vector sequence of simple sentences in the corpora. This ensures that the resultant context vector for G s captures only the necessary language signals to decode a simple sentence.
A classifier (C) is introduced for diversification to ensure that the way decoder G s attends to the hidden representations remains different from G d . It helps distinguish between simple and complex context vector sequences with respect to G s and G d respectively. The classifier diversifies the context vectors given as input to the different decoders. Intuitively, different linguistic signals are needed to decode a complex sentence vis-á-vis a simple one. Refer Section 4.3 for more details.
Both D and C use a CNN-based classifier analogous to Kim (2014). All layers are shared between D and C except the fully-connected layer preceeding the softmax function.

Special Purpose Word-Embeddings
Pre-trained word embeddings are often seen to have positive impact on sequence-to-sequence frameworks (Cho et al., 2014a;. However, traditional embeddings are not good at capturing relations like synonymy (Tissier et al., 2017), which are essential for simplification. For this, our word-embeddings are trained using the Dict2Vec framework 3 . Dict2Vec fine-tunes the embeddings through the help of an external lexicon containing weak and strong synonymy relations. The system is trained on our whole unlabeled datasets and with seed synonymy dictionaries provided by Tissier et al. (2017). Our encoder and decoders share the same word embeddings. Moreover, the embeddings at the input side are kept static but the decoder embeddings are updated as training progresses. Details about hyperparameters are given in Section 5.2.

Training Procedure
Let S and D be sets of simple and complex sentences respectively from large scale unlabeled repositories of simple and complex sentences. Let X s denote a sentence sampled from the set of simple sentences S and X d be a sentence sampled from the set of complex sentences D. Let θ E denote the parameters of E and θ Gs , θ G d denote the parameters of G s and G d respectively. Also, θ C and θ D are the parameters of the discriminator and the classifier modules. Training the model involves optimization of the above parameters with respect to the following losses and denoising, which are explained below.

Reconstruction Loss
Reconstruction Loss is imposed on both E − G s and E − G d paths. E − G s is trained to recon-3 https://github.com/tca19/dict2vec struct sentences from S and E − G d is trained to reconstruct sentences from D. Let P E−Gs (X) and P E−G d (X) denote the reconstruction probabilities of an input sentence X estimated by the E − G s and E − G d models respectively. Reconstruction loss for E − G s and E − G d , denoted by L rec is computed as follows. (2)

Adversarial Loss
Adversarial Loss is imposed upon the context vectors for G s . The idea is that, context vectors extracted even for a complex input sentence by G s should resemble the context vectors from a simple input sentence. The discriminator D is trained to distinguish the fake (complex) context vectors from the real (simple) context vectors. E − G s is trained to perplex the discriminator D, and eventually, at convergence, learns to produce real-like (simple) context vectors from complex input sentences. In practice, we observe that adversarial loss indeed assists E − G s in simplification by encouraging sentence shortening. Let A s (.) be a sequence of context vectors as defined in Section 3.1. Adversarial losses for E − G s , denoted by L adv,Gs and for discriminator D, denoted by L adv,D are as follows.

Diversification Loss
Diversification Loss is imposed by the classifier C on context vectors extracted by G d from complex input sentences in contrast with context vectors extracted by G s from simple input sentences. This helps E − G s to learn to generate simple context vectors distinguishable from complex context vectors. Let A s (.) and A d (.) be sequence of context vectors as defined in Section 3.1. Losses for classifier C, denoted by L div,C and for model E − G s denoted by L div,Gs are computed as follows.
Initialization phase: repeat Update θ E , θ Gs , θ G d using L denoi Update θ E , θ Gs , θ G d using L rec Update θ D , θ C using L adv,D L div,C until specified number of steps are completed Adversarial phase: repeat Update θ E , θ Gs , θ G d using L denoi Update θ E , θ Gs , θ G d using L adv,Gs , L div,Gs , L rec Update θ D , θ C using L adv,D , L div,C until specified number of steps are completed

Denoising
Denoising has proven to be helpful to learn syntactic / structural transformation from the source side to the target side (Artetxe et al., 2018b). Syntactic transformation often requires reordering the input, which the denoising procedure aims to capture. Denoising involves arbitrarily reordering the inputs and reconstructing the original (unperturbed) input from such reordered inputs. In our implementation, the source sentence is reordered by swapping bigrams in the input sentences. The following loss function are used in denoising. Let P E−Gs (X|noise(X)) and P E−G d (X|noise(X)) denote the probabilities that a perturbed input X can be reconstructed by E − G s and E − G d respectively. Denoising loss for models E − G s and E − G d , denoted by L denoi (θ E , θ Gs , θ G d ) is computed as follows. Figure 1 depicts the overall architecture and the losses described above; the training procedure is described in Algorithm 1. The initialization phase involves training the E − G s , E − G d using the reconstruction and denoising losses only. Next, training of D and C happens using the respective adversarial or diversification losses. These losses are not used to update the decoders at this point. This gives the discriminator, classifier and decoders time to learn independent of each other.
In the adversarial phase, adversarial and diversification losses are introduced alongside denoising and reconstruction losses for fine-tuning the encoder and decoders. Algorithm 1 is intended to produce the following results: i) E − G s should simplify its input (irrespective of whether it is simple or complex), and ii) E − G d should act as an auto-encoder in complex sentence domain. The discriminator and classifier enables preserving the appropriate aspects of semantics necessary for each of these pathways through proper modulation of the attention context vectors.
A key requirement for a model like ours is that the dataset used has to be partitioned into two sets, containing relatively simple and complex sentences. The rationale behind having two decoders is that while G s will try to introduce simplified constructs (may be at the expense of loss of semantics), G d will help preserve the semantics. The idea behind using the discriminator and classifier is to retain signals related to language simplicity from which G s will construct simplified sentences. Finally, denoising will help tackle nuances related to syntactic transfer from complex to simple direction. We remind the readers that, TS, unlike machine translation, needs complex syntactic operations such as sentence splitting, rephrasing and paraphrasing, which can not be tackled by the losses and denoising alone. Employing additional explicit mechanisms to handle these in the pipeline is out of the scope of this paper since we seek a prima-facie judgement of our architecture based on how much simplification knowledge can be gained just from the data.

Training with Minimal Supervision
Our system, by design, is highly data-driven, and like any other sequence-to-sequence learning based system, can also leverage labeled data. We propose a semi-supervised variant of our system that could gain additional knowledge of simplification through the help of a small amount of labeled data (in the order of a few thousands). The system undergoes training following steps similar to Algorithm 1, except that it adds another step of optimizing the cross entropy loss for both the E − G s and E − G d pathways by using the reference texts available in the labeled dataset. This step is carried out in the adversarial phase along with other steps (See Algorithm 2).
The cross-entropy loss is imposed on both E − G s and E − G d paths using parallel dataset (details mentioned in Section 5.1) denoted by ∆ = (S p , D p ). For a given parallel simplification sentence pair (X s , X d ), let P E−Gs (X s |X d ) and P E−G d (X d |X s ) denote the probabilities that X s is produced from X d by the E − G s and the reverse is produced by the E − G d respectively. Cross-Entropy loss for E − G s and E − G d denoted by L cross (θ E , θ Gs , θ G d ) is computed as follows: Algorithm 2 Semi-supervised simplification algorithm using denoising, reconstruction, adversarial and diversification losses followed by crossentropy loss using parallel data.
Update θ D , θ C using L adv,D L div,C until specified number of steps are completed Adversarial phase: repeat Update θ E , θ Gs , θ G d using L denoi Update θ E , θ Gs , θ G d using L adv,Gs , L div,Gs , L rec Update θ D , θ C using L adv,D , L div,C Update θ E , θ Gs using L cross Update θ E , θ G d using L cross until specified number of steps are completed

Experiment Setup
In this section we describe the dataset, architectural choices, and model hyperparameters. The implementation of the experimental setup is publicly available 4 .

Dataset
For training our system, we created an unlabeled dataset of simple and complex sentences by partitioning the standard en-wikipedia dump. Since partitioning requires a metric for measuring text simpleness we categorize sentences based on their 4 https://github.com/subramanyamdvss/UnsupNTS  (Flesch, 1948). Sentences with lower FE values (up to 10) are categorized as complex and sentences with FE values greater than 70 are categorized as simple 5 . The FE bounds are decided through trial and error through manual inspection of the categorized sentences. Table 1 shows dataset statistics. Even though the dataset was created with some level of human mediation, the manual effort is insignificant compared to that needed to create a parallel corpus. To train the system with minimal supervision (Section 4.5), we extract 10, 000 pairs of sentences from various datasets such as Wikipedia-SimpleWikipedia dataset introduced in Hwang et al. (2015) and the Split-Rephrase dataset by Narayan et al. (2017) 6 . The Wikipedia-SimpleWikipedia was filtered following Nisioi et al. (2017) and 4000 examples were randomly picked from the filtered set. From the Split-Rephrase dataset, examples containing one compound/complex sentence at the source side and two simple sentences at the target side were selected and 6000 examples were randomly picked from the selected set. The Split-Rephrase dataset is used to promote sentence splitting in the proposed system.
To select and evaluate our models, we use the test and development sets 7 released by (Xu et al., 2016). The test set (359 sentences) and development set (2000 sentences) have 8 simplified reference sentences for each source sentence.

Hyperparameter Settings
For all the variants, we use a hidden state of size 600 and word-embedding size of 300. Classifier C and discriminator D use convolutional layers with filters sizes from 1 to 5. 128 filters of each size are used in the CNN-layers. Other training related hyper parameters include learning rate of 0.00012 for θ E , θ Gs , θ G d , 0.0005 for θ D , θ C and batch size of 36. For learning the word-embedding using Dict2Vec training, the window size is set to 5. Our experiments used at most 13 GB of GPU memory. The Initialization phase and Adversarial phase took 6000 and 8000 steps in batches respectively for both UNTS and UNTS+10K systems.

Evaluation Metrics
For automatic evaluation of our system on the test data, we used four metrics, (a) SARI (b) BLEU (c) FE Difference (d) Word Difference, which are briefly explained below.
SARI (Xu et al., 2016) is an automatic evaluation metric designed to measure the simpleness of the generated sentences. SARI requires access to source, predictions and references for evaluation. Computing SARI involves penalizing the n-gram additions to source which are inconsistent with the references. Similarly, deletions and keep operations are penalized. The overall score is a balanced sum of all the penalties. BLEU (Papineni et al., 2002), a popular metric to evaluate generations and translations is used to measure the correctness of the generations by measuring overlaps between the generated sentences and (multiple) references.
We also compute the average FE score difference between predictions and source in our evaluations. FE-difference measures whether the changes made by the model increase the readability ease of the generated sentence. Word Difference is the average difference between number of words in the source sentence and generation. It is a simple and approximate metric proposed to detect if sentence shortening is occurring or not. Generations with lesser number of changes can still have high SARI and BLEU. Models with such generations can be ruled out by imposing a threshold on the word-diff metric.
Models with high word-diff, SARI and BLEU are picked during model-selection (with validation data). Model selection also involved manually examining the quality and relevance of generations.
We carry out a qualitative analysis of our system through human evaluation. For this the first 50 test samples were selected from the test data. Output of the seven systems reported in Table 2 along with the sources are presented to two native English speakers who would provide two ratings for each output: (a) Simpleness, a binary score [0-1] indicating whether the output is a simplified version of the input or not, (b) Grammaticality of the output in the range of [1][2][3][4][5], in the increasing order of fluency (c) Relatedness score in the range of [1][2][3][4][5] showing if the overall semantics of the input is preserved in the output or not.

Model Variants
Using our design, we propose two different variants for evaluation: (i) Unsupervised Neural TS (UNTS) with SARI as the criteria for model selection, (ii) UNTS with minimal supervision using 10000 labelled examples (UNTS+10K). Models selected using other selection criteria such as BLEU resulted in similar and/or reduced performance (details skipped for brevity).
We carried out the following basic postprocessing steps on the generated outputs. The OOV(out of vocabulary) words in the generations are replaced by the source words with high attention weights. Words repeated consecutively in the generated sentences are merged.

Systems for Comparison
In the absence of any other direct baseline for end-to-end TS, we consider the following unsupervised baselines. We consider the unsupervised NMT framework proposed by (Artetxe et al., 2018b) as a baseline. It uses techniques such as backtranslation and denoising techniques to synthesize more training examples. To use this framework, we treated the set of simple and complex sentences as two different languages. Same model configuration as reported by Artetxe et al. (2018b) is used. We use the term UNMT for this system. Similar to the UNMT system, we also consider unsupervised statistical machine translation (termed as USMT) proposed by Artetxe et al. (2018a), with default parameter setting. Another system, based on the cross alignment technique proposed by Shen et al. (2017) is also used for comparison. The system is originally proposed for the task of sentiment translation. We term this system as ST.
We also compare our approach with existing supervised and unsupervised lexical simplifications like LIGHTLS (Glavaš andŠtajner, 2015), Neural Text Simplification or NTS (Nisioi et al., 2017), Syntax based Machine Translation or SBMT (Xu    PBSMT (Wubben et al., 2012). All the systems are trained using the Wikipedia-SimpleWikipedia dataset (Hwang et al., 2015). The test set is same for all of these and our models. Table 2 shows evaluation results of our proposed approaches along with existing supervised and unsupervised alternatives. We observe that unsupervised baselines such as UNMT and USMT often, after attaining convergence, recreates sentences similar to the inputs. This explains why they achieve higher BLEU and reduced worddifference scores. The ST system did not converge for our dataset after significant number of epochs which affected the performance metrics. The system often produces short sentences which are simple but do not retain important phrases.

Results
Other supervised systems such as SBMT and NTS achieve better content reduction as shown through SARI, BLEU and FE-diff scores; this is expected. However, it is still a good sign that the scores for the unsupervised system UNTS are not far from the supervised skylines. The higher word-diff scores for the unsupervised system also indicate that it is able to perform content reduction (a form of syntactic simplification), which is crucial to TS. This is unlike the existing unsupervised LIGHTLS system which often replaces nouns with related non-synonymous nouns; sometimes increasing the complexity and affecting the meaning. Finally, it is worth noting that aiding the system with a very small amount of labeled data can also benefit our unsupervised pipeline, as suggested by the scores for the UNTS+10K system.
In Table 3, the first column represents what percentage of output form is a simplified version of the input. The second and third columns present the average fluency (grammaticality) scores given by human evaluators and semantic relatedness with input scored through automatic means. Almost all systems are able to produce sentences that are somewhat grammatically correct and retain phrases from input. Supervised systems like PBSMT, as expected, simplify the sentences to the maximum extent. However, our unsupervised variants have scores competitive to the supervised skylines, which is a positive sign. Table 4 shows an anecdotal example, containing outputs from the seven systems. As can be seen, the quality of output from our unsupervised variants, is far from that of the reference output. However, the attempts towards performing lexical simplification (by replacing the word "Neverthless" with "However") and simplification of multi-word phrases ("Tagore emulated numerous styles" getting translated to "Tagore replaced many styles") are quite visible and encouraging. Table 5 presents a few examples demonstrating the capabilities of our system in performing simplifications at lexical and syntactic level. We do observe that such operations are carried out only for a few instances in our test data. Also, our analysis in Appendix B indicate that the system can improve over time with addition of more data. Results for ablations on adversarial and diversification loss are also included in Appendix A.

Conclusion
In this paper, we made a novel attempt towards unsupervised text simplification. We gathered unlabeled corpora containing simple and complex sentences and used them to train our system that is

Type of Simplification Source Prediction
Splitting Calvin Baker is an American novelist . Calvin Baker is an American . American Baker is a birthplace .

Sentence Shortening
During an interview , Edward Gorey mentioned that Bawden was one of his favorite artists , lamenting the fact that not many people remembered or knew about this fine artist .
During an interview , Edward Gorey mentioned that Bawden was one of his favorite artists .

Lexical Replacement
In architectural decoration Small pieces of colored and iridescent shell have been used to create mosaics and inlays , which have been used to decorate walls , furniture and boxes .
In impressive decoration Small pieces of colored and reddish shell have been used to create statues and inlays , which have been used to decorate walls , furniture and boxes . based on a shared encoder and two decoders. A novel training scheme is proposed which allows the model to perform content reduction and lexical simplification simultaneously through our proposed losses and denoising. Experiments were conducted for multiple variants of our system as well as known unsupervised baselines and supervised systems. Qualitative and quantitative analysis of the outputs for a publicly available test data demonstrate that our models, though unsupervised, can perform better than or competitive to these baselines. In future, we would like to improve the system further by incorporating better architectural designs and training schemes to tackle complex simplification operations.

A Ablation Studies
The following table shows results of the proposed system with ablations on adversarial loss (UNTS-ADV) and diversification loss (UNTS-DIV).

B Effects of Variation in Labeled Data Size
The following table shows the effect of labeled data size on the performance of the system. We supplied the system with 2K, 5K, and 10K pairs of complex and simple sentences. From the trained models, models with similar word-diff are chosen for fair comparison. Our observation is that, with increasing data, BLEU as well as SARI increases.