Encouraging Paragraph Embeddings to Remember Sentence Identity Improves Classification

While paragraph embedding models are remarkably effective for downstream classification tasks, what they learn and encode into a single vector remains opaque. In this paper, we investigate a state-of-the-art paragraph embedding method proposed by Zhang et al. (2017) and discover that it cannot reliably tell whether a given sentence occurs in the input paragraph or not. We formulate a sentence content task to probe for this basic linguistic property and find that even a much simpler bag-of-words method has no trouble solving it. This result motivates us to replace the reconstruction-based objective of Zhang et al. (2017) with our sentence content probe objective in a semi-supervised setting. Despite its simplicity, our objective improves over paragraph reconstruction in terms of (1) downstream classification accuracies on benchmark datasets, (2) faster training, and (3) better generalization ability.


Introduction
Methods that embed a paragraph into a single vector have been successfully integrated into many NLP applications, including text classification (Zhang et al., 2017), document retrieval (Le and Mikolov, 2014), and semantic similarity and relatedness (Dai et al., 2015;Chen, 2017).However, downstream performance provides little insight into the kinds of linguistic properties that are encoded by these embeddings.Inspired by the growing body of work on sentence-level linguistic probe tasks (Adi et al., 2017;Conneau et al., 2018), we set out to evaluate a state-of-the-art paragraph embedding method using a probe task to measure how well it encodes the identity of the sentences within a paragraph.We discover that the method falls short of capturing this basic property, and that implementing a simple objective to fix this issue improves classification performance, training speed, and generalization ability.
We specifically investigate the paragraph embedding method of Zhang et al. (2017), which consists of a CNN-based encoder-decoder model (Sutskever et al., 2014) paired with a reconstruction objective to learn powerful paragraph embeddings that are capable of accurately reconstructing long paragraphs.This model significantly improves downstream classification accuracies, outperforming LSTM-based alternatives (Li et al., 2015).
How well do these embeddings encode whether or not a given sentence appears in the paragraph?Conneau et al. (2018) show that such identity information is correlated with performance on downstream sentence-level tasks.We thus design a probe task to measure the extent to which this sentence content property is captured in a paragraph embedding.Surprisingly, our experiments (Section 2) reveal that despite its impressive downstream performance, the model of Zhang et al. (2017) substantially underperforms a simple bagof-words model on our sentence content probe.
Given this result, it is natural to wonder whether the sentence content property is actually useful for downstream classification.To explore this question, we move to a semi-supervised setting by pre-training the paragraph encoder in Zhang et al.'s model (2017) on either our sentence content objective or its original reconstruction objective, and then optionally fine-tuning it on supervised classification tasks (Section 3).Sentence content significantly improves over reconstruction on standard benchmark datasets both with and without fine-tuning; additionally, this objective is four times faster to train than the reconstruction-based variant.Furthermore, pre-training with sentence content substantially boosts generalization ability: fine-tuning a pre-trained model on just 500 labeled reviews from the Yelp sentiment dataset surpasses the accuracy of a purely supervised model trained on 100,000 labeled reviews.
Our results indicate that incorporating probe objectives into downstream models might help improve both accuracy and efficiency, which we hope will spur more linguistically-informed research into paragraph embedding methods.

Probing paragraph embeddings for sentence content
In this section, we first fully specify our probe task before comparing the model of Zhang et al. (2017) to a simple bag-of-words model.Somewhat surprisingly, the latter substantially outperforms the former despite its relative simplicity.DCNN as in the original paper.In all experiments, we use their publicly available code.4

CNN-R BoW
Bag-of-words (BoW): This model is simply an average of the word vectors learned by a trained CNN-R model.BoW models have been shown to be surprisingly good at sentence-level probe tasks (Adi et al., 2017;Conneau et al., 2018).

Probe experimental details
Paragraphs to train our classifiers are extracted from the Hotel Reviews corpus (Li et al., 2015), which has previously been used for evaluating the quality of paragraph embeddings (Li et al., 2015;Zhang et al., 2017).We only consider paragraphs that have at least two sentences.Our dataset has 346,033 training paragraphs, 19,368 for validation, and 19,350 for testing.The average numbers of sentences per paragraph, tokens per paragraph, and tokens per sentence are 8.0, 123.9, and 15.6, respectively.The vocabulary contains 25,000 tokens.To examine the effect of the embedding dimensionality d on the results, we trained models with d ∈ {100, 300, 500, 700, 900}.
Each classifier is a feed-forward neural network with a single 300-d ReLu layer.We use a minibatch size of 32, Adam optimizer (Kingma and Ba, 2015) with a learning rate of 2e-4, and a dropout rate of 0.5 (Srivastava et al., 2014).We trained classifiers for a maximum of 100 epochs with early stopping based on validation performance.

BoW outperforms CNN-R on sentence content
Our probe task results are displayed in Figure 1.Interestingly, BoW performs significantly better

Encoder
Figure 2: A visualization of our semi-supervised approach.We first train the CNN encoder (shown as two copies with shared parameters) on unlabeled data using our sentence content objective.The encoder is then used for downstream classification tasks.

Setting
CNN-R BoW than CNN-R, achieving an accuracy of 87.2% at 900 dimensions, compared to only 66.4% for CNN-R.We hypothesize that much of BoW's success is because it is easier for the model to perform approximate string matches between the candidate sentence and text segments within the paragraph than it is for the highly non-linear representations of CNN-R.
To investigate this further, we repeat the experiment, but exclude the sentence s + from the paragraph p during both training and testing.As we would expect (see Table 1), BoW's performance degrades significantly (20.6% absolute) with s + excluded from p, whereas CNN-R experiences a more modest drop (3.6%).While BoW still outperforms CNN-R in this new setting, the dramatic drop in accuracy suggests that it relies much more heavily on low-level matching.

Sentence content improves paragraph classification
Motivated by our probe results, we further investigate whether incorporating the sentence content property into a paragraph encoder can help increase downstream classification accuracies.We propose a semi-supervised approach by pretraining the encoder of CNN-R using our sentence content objective, and optionally fine-tuning it on different classification tasks.A visualization of this procedure can be seen in Figure 2. We compare our approach (henceforth CNN-SC) without and with fine-tuning against CNN-R, which uses a reconstruction-based objective. 5We report comparisons on three standard paragraph classification datasets: Yelp Review Polarity (Yelp), DB-Pedia, and Yahoo!Answers (Yahoo) (Zhang et al., 2015), which are instances of common classification tasks, including sentiment analysis and topic classification.CNN-SC achieves an accuracy of 90.0%, outperforming CNN-R by a large margin.Additionally, sentence content is four times as fast to train as the computationally-expensive reconstruction objective. 6Are representations obtained using these objectives more useful when learned from in-domain data?To examine the dataset effect, we repeat our experiments using paragraph embeddings pre-trained using these objectives on a subset of Wikipedia (560K paragraphs).The second row of Table 3 shows that both approaches suffer a drop in downstream accuracy when pre-trained on out-of-domain data.Interestingly, CNN-SC still performs best, indicating that sentence content is more suitable for downstream classification.Another advantage of our sentence content objective over reconstruction is that it better correlates to downstream accuracy (see Appendix A.2).For reconstruction, there is no apparent correlation between BLEU and downstream accuracy; while BLEU increases with the number of epochs, the downstream performance quickly begins to decrease.This result indicates that early stopping based on BLEU is not feasible with reconstruction-based pre-training objectives.now to our fine-tuning experiments.Specifically, we take the CNN encoder pre-trained using our sentence content objective and then fine-tune it on downstream classification tasks with supervised labels.While our previous version of CNN-SC created just a single positive/negative pair of examples from a single paragraph, for our finetuning experiments we create a pair of examples from every sentence in the paragraph to maximize the training data.For each task, we compare against the original CNN-R model in (Zhang et al., 2017).Figure 3  Finally, when all labeled training data is used, CNN-SC achieves higher classification accuracy than CNN-R on all three datasets (Table 4).While CNN-SC exhibits a clear preference for target task unlabeled data (see Table 3), we can additionally leverage large amounts of unlabeled general-domain data by incorporating pretrained word representations from language models into CNN-SC.Our results show that further improvements can be achieved by training the sentence content objective on top of the pre-trained language model representations from ULMFiT (Howard and Ruder, 2018) (see Appendix A.3), indicating that our sentence content objective learns complementary information.On Yelp, it exceeds the performance of training from scratch on the whole labeled data (560K examples) with only 0.1% of the labeled data.

With fine-tuning, CNN-SC substantially boosts accuracy and generalization
CNN-SC implicitly learns to distinguish between class labels The substantial difference in downstream accuracy between pre-training on indomain and out-of-domain data (Table 3) implies that the sentence content objective is implicitly learning to distinguish between class labels (e.g., that a candidate sentence with negative sentiment is unlikely to belong to a paragraph with positive sentiment).If true, this result implies that CNN-SC prefers not only in-domain data but also a representative sample of paragraphs from all class labels.To investigate, we conduct an additional experiment that restricts the class label from which negative sentence candidates s − are sampled.We experiment with two sources of s − : (1) paragraphs of the same class label as the probe paragraph (CNN-SC − ), and (2) paragraphs from a different class label (CNN-SC + ). Figure 4 reveals that the performance of CNN-SC drops dramatically when trained on the first dataset and improves when trained on the second dataset, which confirms our hypothesis.
To analyze word and sentence embeddings, recent work has studied classification tasks that probe them for various linguistic properties (Shi et al., 2016;Adi et al., 2017;Belinkov et al., 2017a,b;Conneau et al., 2018;Tenney et al., 2019).In this paper, we extend the notion of probe tasks to the paragraph level.

Conclusions and Future work
In this paper, we evaluate a state-of-the-art paragraph embedding model, based on how well it captures the sentence identity within a paragraph.Our results indicate that the model is not fully aware of this basic property, and that implementing a simple objective to fix this issue improves classification performance, training speed, and generalization ability.Future work can investigate other embedding methods with a richer set of probe tasks, or explore a wider range of downstream tasks.

A Appendices
A.1 BoW models outperform more complex models on our sentence content probe In addition to the paragraph embedding models presented in the main paper, we also experiment Other BoW models: We also consider other BoW models with pre-trained word embeddings or contextualized word representations, including Word2Vec (Mikolov et al., 2013), Glove (Pennington et al., 2014), and ELMo (Peters et al., 2018).Paragraph embeddings are computed as the average of the word vectors.For ELMo, we take the average of the layers.The results of our sentence content probe task are summarized in Table 5.

A.2 Sentence content better correlates to downstream accuracy than reconstruction
See Figure 5.  A.3 Further improvements by training sentence content on top of pre-trained language model representations Figure 6 shows that further improvements can be achieved by training sentence content on top of the pre-trained language model representations from ULMFiT (Howard and Ruder, 2018) on Yelp and IMDB (Maas et al., 2011) datasets, indicating that our sentence content objective learns complementary information.7On Yelp, it exceeds the performance of training from scratch on the whole labeled data (560K examples) with only 0.1% of the labeled data.

Figure 1 :
Figure 1: Probe task accuracies across representation dimensions.BoW surprisingly outperforms the more complex model CNN-R.

Figure 4 :
Figure 4: CNN-SC implicitly learns to distinguish between class labels.

Figure 5 :
Figure 5: Pre-training performance vs. downstream accuracy on Yelp.Performance measured on validation data.There is no apparent correlation between BLEU and downstream accuracy.

Figure 6 :
Figure6: Further improvements can be achieved by training sentence content (SC) on top of the pretrained language model (LM) representations from ULMFiT(Howard and Ruder, 2018).

Table 1 :
Probe task accuracies without and with s + excluded from p, measured at d = 300.BoW's accuracy degrades quickly in the latter case, suggesting that it relies much more on low-level matching.

Table 2 :
Properties of the text classification datasets used for our evaluations.
Table 2 shows the statistics for each dataset.Paragraphs from each training set without labels were used to generate training data for unsupervised pre-training.
CNN-SC significantly improves over CNN-R.

Table 4 :
We switch gears Zhang et al. (2017)other baseline models that do not use external data, including CNN-R.All baseline models are taken fromZhang et al. (2017). that