Disentangling Semantics and Syntax in Sentence Embeddings with Pre-trained Language Models

Pre-trained language models have achieved huge success on a wide range of NLP tasks. However, contextual representations from pre-trained models contain entangled semantic and syntactic information, and therefore cannot be directly used to derive useful semantic sentence embeddings for some tasks. Paraphrase pairs offer an effective way of learning the distinction between semantics and syntax, as they naturally share semantics and often vary in syntax. In this work, we present ParaBART, a semantic sentence embedding model that learns to disentangle semantics and syntax in sentence embeddings obtained by pre-trained language models. ParaBART is trained to perform syntax-guided paraphrasing, based on a source sentence that shares semantics with the target paraphrase, and a parse tree that specifies the target syntax. In this way, ParaBART learns disentangled semantic and syntactic representations from their respective inputs with separate encoders. Experiments in English show that ParaBART outperforms state-of-the-art sentence embedding models on unsupervised semantic similarity tasks. Additionally, we show that our approach can effectively remove syntactic information from semantic sentence embeddings, leading to better robustness against syntactic variation on downstream semantic tasks.


Introduction
Semantic sentence embedding models encode sentences into fixed-length vectors based on their semantic relatedness with each other. If two sentences are more semantically related, their corresponding sentence embeddings are closer. As sentence embeddings can be used to measures semantic relatedness without requiring supervised data, they have been used in many applications, such as semantic textual similarity (Agirre et al., 2016a), question answering (Nakov et al., 2017), and natural language inference (Artetxe and Schwenk, 2019a).
Recent years have seen huge success of pretrained language models across a wide range of NLP tasks (Devlin et al., 2019;Lewis et al., 2020). However, several studies (Reimers and Gurevych, 2019;Li et al., 2020) have found that sentence embeddings from pre-trained language models perform poorly on semantic similarity tasks when the models are not fine-tuned on task-specific data. Meanwhile, Goldberg (2019) shows that BERT without fine-tuning performs surprisingly well on syntactic tasks. Hence, we posit that these contextual representations from pre-trained language models without fine-tuning capture entangled semantic and syntactic information, and therefore are not suitable for sentence-level semantic tasks.
Ideally, the semantic embedding of a sentence should not encode its syntax, and two semantically similar sentences should have close semantic embeddings regardless of their syntactic differences. While various models (Conneau et al., 2017;Cer et al., 2018;Reimers and Gurevych, 2019) have been proposed to improve the performance of sentence embeddings on downstream semantic tasks, most of these approaches do not attempt to separate syntactic information from sentence embeddings.
To this end, we propose ParaBART, a semantic sentence embedding model that learns to disentangle semantics and syntax in sentence embeddings. Our model is built upon BART (Lewis et al., 2020), a sequence-to-sequence Transformer (Vaswani et al., 2017) model pre-trained with selfdenoising objectives. Parallel paraphrase data is a good source of learning the distinction between semantics and syntax, as paraphrase pairs naturally share the same meaning but often differ in syntax. Taking advantage of this fact, ParaBART is trained to perform syntax-guided paraphrasing, where a source sentence containing the desired semantics and a parse tree specifying the desired syntax are given as inputs. In order to generate a paraphrase that follows the given syntax, ParaBART uses separate encoders to learn disentangled semantic and syntactic representations from their respective inputs. In this way, the disentangled representations capture sufficient semantic and syntactic information needed for paraphrase generation. The semantic encoder is also encouraged to ignore the syntax of the source sentence, as the desired syntax is already provided by the syntax input.
ParaBART achieves strong performance across unsupervised semantic textual similarity tasks. Furthermore, semantic embeddings learned by ParaBART contain significantly less syntactic information as suggested by probing results, and yield robust performance on datasets with syntactic variation.

Related Work
Various sentence embedding models have been proposed in recent years. Most of these models utilize supervision from parallel data Artetxe and Schwenk, 2019b;Wieting et al., 2019Wieting et al., , 2020, natural language inference data (Conneau et al., 2017;Cer et al., 2018;Reimers and Gurevych, 2019), or a combination of both (Subramanian et al., 2018).
Many efforts towards controlled text generation have been focused on learning disentangled sentence representations (Hu et al., 2017;Fu et al., 2018;John et al., 2019). In the context of disentangling semantics and syntax, Bao et al. (2019) and Chen et al. (2019) utilize variational autoencoders to learn two latent variables for semantics and syntax. In contrast, we use the outputs of a constituency parser to learn purely syntactic representations, and facilitate the usage of powerful pre-trained language models as semantic encoders.
Our approach is also related to prior work on syntax-controlled paraphrase generation (Iyyer et al., 2018;Kumar et al., 2020;Goyal and Durrett, 2020;Huang and Chang, 2021). While these approaches focus on generating high-quality paraphrases that conform to the desired syntax, we are interested in how semantic and syntactic information can be disentangled and how to obtain good semantic sentence embeddings. Figure 1: An overview of ParaBART. The model extracts semantic and syntactic representations from a source sentence and a target parse respectively, and uses both the semantic sentence embedding and the target syntactic representations to generate the target paraphrase. ParaBART is trained in an adversarial setting, with the syntax discriminator (red) trying to decode the source syntax from the semantic embedding, and the paraphrasing model (blue) trying to fool the syntax discriminator and generate the target paraphrase at the same time.

Proposed Model -ParaBART
Our goal is to build a semantic sentence embedding model that learns to separate syntax from semantic embeddings. ParaBART is trained to generate syntax-guided paraphrases, where the model attempts to only extract the semantic part from the input sentence, and combine it with a different syntax specified by the additional syntax input in the form of a constituency parse tree. Figure 1 outlines the proposed model, which consists of a semantic encoder that learns the semantics of a source sentence, a syntactic encoder that encodes the desired syntax of a paraphrase, and a decoder that generates a corresponding paraphrase. Additionally, we add a syntax discriminator to adversarially remove syntactic information from the semantic embeddings.
Given a source sentence S 1 and a target constituency parse tree P 2 , ParaBART is trained to generate a paraphrase S 2 that shares the semantics of S 1 and conforms to the syntax specified by P 2 . Semantics and syntax are two key aspects that determine how a sentence is generated. Our model learns purely syntactic representations from the output trees generated by a constituency parser, and extracts the semantic embedding directly from the source sentence. The syntax discriminator and the syntactic encoder are designed to remove source syntax and provide target syntax, thus encouraging the semantic encoder to only capture source semantics.

Semantic Encoder
The semantic encoder E sem is a Transformer encoder that embeds a sentence S = (s (1) , ..., s (m) ) into contextual semantic representations: Then, we take the mean of these contextual representations u (i) to get a fixed-length semantic sentence embeddingū Syntactic Encoder The syntactic encoder E syn is a Transformer encoder that takes a linearized constituency parse tree P = (p (1) , ..., p (n) ) and converts it into contextual syntactic representations For example, the linearized parse tree of the sentence "This book is good." is "(S (NP (DT) (NN)) (VP (VBZ) (ADJP)) (.))". Such input sequence preserves the tree structure, allowing the syntactic encoder to capture the exact syntax needed for decoding.
Decoder The decoder D dec uses the semantic sentence embeddingū and the contextual syntactic representations V to generate a paraphrase that shares semantics with the source sentence while following the syntax of the given parse tree. In other words, During training, given a source sentence S 1 , a target parse tree P 2 and a target paraphrase S 2 = (s 1 2 , ..., s l 2 ), we minimize the following paraphrase generation loss: Since the syntactic representations do not contain semantics, the semantic encoder needs to accurately capture the semantics of the source sentence for a paraphrase to be generated. Meanwhile, the full syntactic structure of the target is provided by the syntactic encoder, thus encouraging the semantic encoder to ignore the source syntax.
Syntax Discriminator To further encourage the disentanglement of semantics and syntax, we employ a syntax discriminator to adversarially remove syntactic information from semantic embeddings. We first train the syntax discriminator to predict the syntax from its semantic embedding, and then train the semantic encoder to "fool" the syntax discriminator such that the source syntax cannot be predicted from the semantic embedding.
More specifically, we adopt a simplified approach similar to John et al. (2019) by encoding source syntax as a Bag-of-Words vector h of its constituency parse tree. For any given source parse tree, this vector contains the count of occurrences of every constituent tag, divided by the total number of constituents in the parse tree. Given the semantic sentence embeddingū, our linear syntax discriminator D dis predicts h by with the following adversarial loss: where T denotes the set of all constituent tags.
Training We adversarially train E sem , E syn , D dec , and D dis with the following objective: where λ adv is a hyperparameter to balance loss terms. In each iteration, we update the D dis by considering the inner optimization, and then update E sem , E syn and D dec by considering the outer optimization.

Experiments
In this section, we demonstrate that ParaBART is capable of learning semantic sentence embeddings that capture semantic similarity, contain less syntactic information, and yield robust performance against syntactic variation on semantic tasks.

Setup
We sample 1 million English paraphrase pairs from ParaNMT-50M , and split this dataset into 5,000 pairs as the validation set and the rest as our training set. The constituency parse trees of all sentences are obtained from Stanford CoreNLP (Manning et al., 2014). We finetune a 6-layer BART base encoder as the semantic encoder and the first BART base decoder layer as the decoder for our model. We train ParaBART on a GTX 1080Ti GPU using AdamW (Loshchilov and Hutter, 2019) optimizer with a learning rate of 2 × 10 −5 for the encoder and syntax discriminator, and 1 × 10 −4 for the rest of the model. The batch size is set to 64. All models are trained for 10 epochs, which takes about 2 days to complete. The maximum length of input sentences and linearized parse trees are set to 40 and 160 respectively. We set the weight of adversarial loss to 0.1. Appendix A shows more implementation details.
Baselines We compare our model with other sentence embeddings models, including InferSent  (Wieting et al., 2020). We also include mean-pooled BERT base and BART base embeddings. In addition to ParaBART, we consider two model ablations: ParaBART without adversarial loss, and ParaBART without syntactic guidance and adversarial loss.

Semantic Textual Similarity
We evaluate our semantic sentence embeddings on the unsupervised Semantic Textual Similarity (STS) tasks from SemEval 2012 to 2016 (Agirre et al., 2012;2013;2016b) and STS Benchmark test set (Cer et al., 2017), where the goal is to predict a continuous-valued score between 0 and 5 indicating how similar the meanings of a sentence pair are. For all models, we compute the cosine similarity of embedding vectors as the semantic similarity measure. We use the standard SentEval toolkit (Conneau and Kiela, 2018) for evaluation and report average Pearson correlation over all domains.  As shown in Table 1, both average BERT embeddings and average BART embeddings perform poorly on STS tasks, as the entanglement of semantic and syntactic information leads to low correlation with semantic similarity. Training ParaBART on paraphrase data substantially improves the correlation. With the addition of syntactic guidance and adversarial loss, ParaBART achieves the best overall performance across STS tasks, showing the effectiveness of our approach.

Syntactic Probing
To better understand how well our model learns to disentangle syntactic information from semantic embeddings, we probe our semantic sentence embeddings with downstream syntactic tasks. Following Conneau et al. (2018), we investigate to what degree our semantic sentence embeddings can be used to identify bigram word reordering (BShift), estimate parse tree depth (TreeDepth), and predict parse tree top-level constituents (Top-Const). Top-level constituents are defined as the group of constituency parse tree nodes immediately below the sentence (S) node. We use the datasets provided by SentEval (Conneau and Kiela, 2018) to train a Multi-Layer Perceptron classifier with a single 50-neuron hidden layer on top of semantic sentence embeddings, and report accuracy on all

QQP-Easy
What are the essential skills of the project management? What are the essential skills of a project manager? QQP-Hard Is there a reason why we should travel alone? What are some reasons to travel alone?

tasks.
As shown in Table 2, sentence embeddings pooled from pre-trained BART model contain rich syntactic information that can be used to accurately predict syntactic properties including word order and top-level constituents. The disentanglement induced by ParaBART is evident, lowering the accuracy of downstream syntactic tasks by more than 10 points compared to pre-trained BART embeddings and ParaBART without adversarial loss and syntactic guidance. The results suggest that the semantic sentence embeddings learned by ParaBART indeed contain less syntactic information.

Robustness Against Syntactic Variation
Intuitively, semantic sentence embedding models that learn to disentangle semantics and syntax are expected to yield more robust performance on datasets with high syntactic variation. We consider the task of paraphrase detection on Quora Question Pairs (Iyer et al., 2017) dev set as a testbed for evaluating model robustness. We categorize paraphrase pairs based on whether they share the same top-level constituents. We randomly sample 1,000 paraphrase pairs from each of the two classes, combined with a common set of 1,000 randomly sampled non-paraphrase pairs, to create two datasets QQP-Easy and QQP-Hard. Paraphrase pairs from QQP-Hard are generally harder to identify as they are much more syntactically different compared to those from QQP-Easy. Table 3 shows some examples from these two datasets. We evaluate semantic sentence embeddings on these datasets in an unsupervised manner by computing the cosine similarity as the semantic similarity measure. We search for the best threshold between -1 and 1 with a step size of 0.01 on each dataset, and report the highest accuracy. The results are shown in Table 4.
While Universal Sentence Encoder scores much higher than other models on QQP-Easy, its performance degrades significantly on QQP-Hard. In comparison, ParaBART demonstrates better robustness against syntactic variation, and surpasses USE to become the best model on the more syntactically  diverse QQP-Hard. It is worth mentioning that even pre-trained BART embeddings give decent results on QQP-Easy, suggesting large overlaps between paraphrase pairs from QQP-Easy. On the other hand, the poor performance of pre-trained BART embeddings on a more syntactically diverse dataset like QQP-Hard clearly shows its incompetence as semantic sentence embeddings.

Conclusion
In this paper, we present ParaBART, a semantic sentence embedding model that learns to disentangle semantics and syntax in sentence embeddings from pre-trained language models. Experiments show that our semantic sentence embeddings yield strong performance on unsupervised semantic similarity tasks. Further investigation demonstrates the effectiveness of disentanglement, and robustness of our semantic sentence embeddings against syntactic variation on downstream semantic tasks.

Acknowledgments
We thank anonymous reviewers for their helpful feedback. We thank UCLA-NLP group for the valuable discussions and comments. This work is supported in part by Amazon Research Award.

Ethics Considerations
Our sentence embeddings can potentially capture bias reflective of the training data we use, which is a common problem for models trained on large annotated datasets. While the focus of our work is to disentangle semantics and syntax, our model can potentially generate offensive or biased content learned from training data if it is used for paraphrase generation. We suggest carefully examining the potential bias exhibited in our models before deploying them in any real-world applications.