Evaluation Benchmarks and Learning Criteria for Discourse-Aware Sentence Representations

Prior work on pretrained sentence embeddings and benchmarks focus on the capabilities of stand-alone sentences. We propose DiscoEval, a test suite of tasks to evaluate whether sentence representations include broader context information. We also propose a variety of training objectives that makes use of natural annotations from Wikipedia to build sentence encoders capable of modeling discourse. We benchmark sentence encoders pretrained with our proposed training objectives, as well as other popular pretrained sentence encoders on DiscoEval and other sentence evaluation tasks. Empirically, we show that these training objectives help to encode different aspects of information in document structures. Moreover, BERT and ELMo demonstrate strong performances over DiscoEval with individual hidden layers showing different characteristics.


Introduction
Pretrained sentence representations have been found useful in various downstream tasks such as visual question answering (Tapaswi et al., 2016), script inference (Pichotta and Mooney, 2016), and information retrieval (Le and Mikolov, 2014;Palangi et al., 2016).Benchmark datasets (Adi et al., 2017;Conneau and Kiela, 2018;Wang et al., 2018aWang et al., , 2019) ) have been proposed to evaluate the encoded knowledge, where the focus has been primarily on natural language understanding capabilities of the representation of a stand-alone sentence, such as its semantic roles, rather than the broader context in which it is situated.
Figure 1: An RST discourse tree from the RST Discourse Treebank."N" represents "nucleus", containing basic information for the relation."S" represents "satellite", containing additional information about the nucleus.
In this paper, we seek to incorporate and evaluate discourse knowledge in general purpose sentence representations.A discourse is a coherent, structured group of sentences that acts as a fundamental type of structure in natural language (Jurafsky and Martin, 2009).A discourse structure is often characterized by the arrangement of semantic elements across multiple sentences, such as entities and pronouns.The simplest such arrangement (i.e., linearly-structured) can be understood as sentence ordering, where the structure is manifested in the timing of introducing entities.Deeper discourse structures use more complex relations among sentences (e.g., tree-structured; see Figure 1).
Theoretically, discourse structures have been approached through Centering Theory (Grosz et al., 1995) for studying distributions of entities across text and Rhetorical Structure Theory (RST;Mann and Thompson, 1988) for modelling the logical structure of natural language via discourse trees.Researchers have found modelling discourse useful in a range of tasks (Guzmán et al., 2014;Narasimhan and Barzilay, 2015;Liu and Lapata, 2018;Pan et al., 2018), including summarization (Gerani et al., 2014), text classification (Ji and Smith, 2017), and text generation (Bosselut et al., 2018).
We also propose a set of novel multi-task learning objectives building upon standard pretrained sentence encoders, which rely on the assumption of distributional semantics of text.These objectives depend only on the natural structure in structured document collections like Wikipedia.
Empirically, we benchmark our models and several popular sentence encoders on DiscoEval and SentEval (Conneau and Kiela, 2018).We find that our proposed training objectives help the models capture different characteristics in the sentence representations.Additionally, we find that ELMo shows strong performance on SentEval, whereas BERT performs the best among the pretrained embeddings on DiscoEval.Both BERT and Skipthought vectors (Kiros et al., 2015), which have training losses explicitly related to surrounding sentences, perform much stronger compared to their respective prior work, demonstrating the effectiveness of incorporating losses that make use of broader context.Through per-layer analysis, we also find that for both BERT and ELMo, deep layers consistently outperform shallower ones on DiscoEval, showing different trends from Sent-Eval where the shallow layers have the best performance.
There is a great deal of prior work on pretrained representations (Le and Mikolov, 2014;Kiros et al., 2015;Hill et al., 2016;Wieting et al., 2016;McCann et al., 2017;Gan et al., 2017;Peters et al., 2018a;Logeswaran and Lee, 2018;Devlin et al., 2019;Tang and de Sa, 2019;Yang et al., 2019;Liu et al., 2019b, inter alia).Skip-thought vectors form an effective architecture for generalpurpose sentence embeddings.The model encodes a sentence to a vector representation, and then predicts the previous and next sentences in the discourse context.Since Skip-thought performs well in downstream evaluation tasks, we use this neighboring-sentence objective as a starting point for our models.
There is also work on incorporating discourse related objectives into the training of sentence representations.Jernite et al. (2017) propose binary sentence ordering, conjunction prediction (requiring manually-defined conjunction groups), and next sentence prediction.Similarly, Sileo et al. (2019) and Nie et al. (2019) create training datasets automatically based on discourse relations provided in the Penn Discourse Treebank (PDTB; Lin et al., 2009).
Our work differs from prior work in that we propose a general-purpose pretrained sentence embedding evaluation suite that covers multiple aspects of discourse knowledge and we propose novel training signals based on document structure, including sentence position and section titles, without requiring additional human annotation.

Discourse Evaluation
We propose DiscoEval, a test suite of 7 tasks to evaluate whether sentence representations include semantic information relevant to discourse processing.Below we describe the tasks and datasets, as well as the evaluation framework.We closely follow the SentEval sentence embedding evaluation suite, in particular its supervised sentence and sentence pair classification tasks, which use predefined neural architectures with slots for fixeddimensional sentence embeddings.All DiscoEval tasks are modelled by logistic regression unless otherwise stated in later sections.
We also experimented with adding hidden layers to the DiscoEval classification models.However, we find simpler linear classifiers to provide a clearer comparison among sentence embedding methods.More complex classification models lead to noisier results, as more of the modelling burden is shifted to the optimization of the classifiers.Hence we decide to evaluate the sentence embeddings with simple classification models.
In the rest of this section, we will use [•, •, • • • ] to denote concatenation of vectors, for elementwise multiplication, and | • | for element-wise absolute value.

Discourse Relations
As the most direct way to probe discourse knowledge, we consider the task of predicting annotated discourse relations among sentences.We use two human-annotated datasets: the RST Discourse Treebank (RST-DT; Carlson et al., 2001) and the Penn Discourse Treebank (PDTB; Prasad et al., 2008).They have different labeling schemes.PDTB provides discourse markers for adjacent sentences, whereas RST-DT offers document-level discourse trees, which recently was used to evaluate discourse knowledge encoded in document-level models (Ferracane et al., 2019).The difference allows us to see if the pretrained representations capture local or global information about discourse structure.
More specifically, as shown in Figure 1, in RST-DT, text is segmented into basic units, elementary discourse units (EDUs), upon which a discourse tree is built recursively.Although a relation can take multiple units, we follow prior work (Ji and Eisenstein, 2014) to use right-branching trees for non-binary relations to binarize the tree structure and use the 18 coarse-grained relations defined by Carlson et al. (2001).
When evaluating pretrained sentence encoders on RST-DT, we first encode EDUs into vectors, then use averaged vectors of EDUs of subtrees as the representation of the subtrees.The target prediction is the label of nodes in discourse trees and the input to the classifier is [x left , x right , x left x right , |x left − x right |], where x left and x right are vec-1.In any case, the brokerage firms are clearly moving faster to create new ads than they did in the fall of 1987.

2.
[But] it remains to be seen whether their ads will be any more effective.label: Comparison.Contrast tor representations of the left and right subtrees respectively.For example, the input for target "NN-Attribution" in Figure 1 would be x left = x 1 +x 2 2 , x right = x 3 , where x i is the encoded representation for the ith EDU in the text.We use the standard data splits, where there are 347 documents for training and 38 documents for testing.We choose 35 documents from the training set to serve as a validation set.
For PDTB, we use a pair of sentences to predict discourse relations.Following Lin et al. (2009), we focus on two kinds of relations from PDTB: explicit (PDTB-E) and implicit (PDTB-I).The sentence pairs with explicit relations are two consecutive sentences with a particular connective word in between.Figure 2 is an example of an explicit relation.
In the PDTB, annotators insert an implicit connective between adjacent sentences to reflect their relations, if such an implicit relation exists.Figure 3 shows an example of an implicit relation.The PDTB provides a three-level hierarchy of relation tags.In DiscoEval, we use the second level of types (Lin et al., 2009), as they provide finer semantic distinctions compared to the first level.To ensure there is a reasonable amount of evaluation data, we use sections 2-14 as training set, 15-18 as development set, and 19-23 as test set.In addition, we filter out categories that have less than 10 instances.This leaves us 12 categories for explicit relations and 11 for implicit ones.Category names are listed in the supplementary material.
We use the sentence embeddings to infer sentence relations with supervised training.As input to the classifier, we encode both sentences to vector representations x 1 and x 2 , concatenated with their element-wise product and absolute difference: -She was excited thinking she must have lost weight.
-Bonnie hated trying on clothes.
-Then she realized they actually size 14s, and 12s.
-She picked up a pair of size 12 jeans from the display.
-When she tried them on they were too big!

Sentence Position (SP)
We create a task that we call Sentence Position.It can be seen as way to probe the knowledge of linearly-structured discourse, where the ordering corresponds to the timings of events.When constructing this dataset, we take five consecutive sentences from a corpus, randomly move one of these five sentences to the first position, and ask models to predict the true position of the first sentence in the modified sequence.
We create three versions of this task, one for each of the following three domains: the first five sentences of the introduction section of a Wikipedia article (Wiki), the ROC Stories corpus (ROC; Mostafazadeh et al., 2016), and the first 5 sentences in the abstracts of arXiv papers (arXiv; Chen et al., 2016).Figure 4 shows an example of this task for the ROC Stories domain.The first sentence should be in the fourth position among these sentences.To make correct predictions, the model needs to be aware of both typical orderings of events as well as how events are described in language.In the example shown, Bonnie's excitement comes from her imagination so it must happen after she picked up the jeans and tried them on but right before she realized the actual size.
To train classifiers for these tasks, we do the following.We first encode the five sentences to vector representations x i .As input to the classifier, we include x 1 and the concatenation of

Binary Sentence Ordering (BSO)
Similar to sentence position prediction, Binary Sentence Ordering (BSO) is a binary classification task to determine the order of two sentences.The fact that BSO only has a pair of sentences as input makes it different from Sentence Position, where there is more context, and we hope that BSO can evaluate the ability of capturing local discourse coherence in the given sentence representations.The data comes from the same three domains as Sentence Position, and each instance is a pair of con-1.These functions include fast and synchronized response to environmental change, or long-term memory about the transcriptional status.2. Focusing on the collective behaviors on a population level, we explore potential regulatory functions this model can offer.secutive sentences.
Figure 5 shows an example from the arXiv domain of the Binary Sentence Ordering task.The order of the sentences in this instance is incorrect, as the "functions" are referenced before they are introduced.To detect the incorrect ordering in this example, the encoded representations need to be able to provide information about new and old information in each sentence.
To form the input when training classifiers, we concatenate the embeddings of both sentences with their element-wise difference:

Discourse Coherence (DC)
Inspired by prior work on chat disentanglement (Elsner andCharniak, 2008, 2010) and sentence clustering (Wang et al., 2018b), we propose a sentence disentanglement task.The task is to determine whether a sequence of six sentences forms a coherent paragraph.We start with a coherent sequence of six sentences, then randomly replace one of the sentences (chosen uniformly among positions 2-5) with a sentence from another discourse.This task, which we call Discourse Coherence (DC), is a binary classification task and the datasets are balanced between positive and negative instances.
We use data from two domains for this task: Wikipedia and the Ubuntu IRC channel. 2 For Wikipedia, we begin by choosing a sequence of six sentences from a Wikipedia article.For purposes of choosing difficult distractor sentences, we use the Wikipedia categories of each document as an indication of its topic.To create a negative instance, we randomly sample a sentence from another document with a similar set of categories (measured by the percentage of overlapping categories).This sampled sentence replaces one of the six consecutive sentences in the original sequence.When splitting the train, development, 1.It is possible he was the youngest of the family as the name "Sextus" translates to sixth in English implying he was the sixth of two living and three stillborn brothers.2. According to Roman tradition, his rape of Lucretia was the precipitating event in the overthrow of the monarchy and the establishment of the Roman Republic.3. Tarquinius Superbus was besieging Ardea, a city of the Rutulians.4. The place could not be taken by force, and the Roman army lay encamped beneath the walls.5.He was soon elected to the Academy's membership (although he had to wait until 1903 to be elected to the Society of American Artists), and in 1883 he opened a New York studio, dividing his time for several years between Manhattan and Boston.6.As nothing was happening in the field, they mounted their horses to pay a surprise visit to their homes.and test sets, we ensure there are no overlapping documents among them.
Our proposed dataset differs from the sentence clustering task of Wang et al. (2018b) in that it preserves sentence order and does not anonymize or lemmatize words, because they play an important role in conveying information about discourse coherence.
For the Ubuntu domain, we use the human annotations of conversation thread structure from Kummerfeld et al. (2019) to provide us with a coherent sequence of utterances.We filter out sentences by heuristic rules to avoid overly technical and unsolvable cases.The negative sentence is randomly picked from other conversations.Similarly, when splitting the train, development, and test sets, we ensure there are no overlapping conversations among them.
Figure 6 is an instance of the Wikipedia domain of the Discourse Coherence task.This instance is not coherent and the boldfaced text is from a different document.The incoherence can be found either by comparing characteristics of the entity being discussed or by the topic of the sentence group.Solving this task is non-trivial as it may require the ability to perform inference across multiple sentences.
In this task, we encode all sentences to vector representations and concatenate all of them ([x 1 , x 2 , x 3 , x 4 , x 5 , x 6 ]) as input to the classification model.Note that in this task, we use a hidden layer of 2000 dimensions with sigmoid activation in the classification model, as this is necessary The first is from an Abstract while the second is not.for the classifier to use features based on multiple inputs simultaneously given the simple concatenation as input.We could have developed richer ways to encode the input so that a linear classifier would be feasible (e.g., use the element-wise products of all pairs of sentence embeddings), but we wish to keep the input dimensionality of the classifier small enough that the classifier will be learnable given fixed sentence embeddings and limited training data.

Sentence Section Prediction (SSP)
The Sentence Section Prediction (SSP) task is defined as determining the section of a given sentence.The motivation behind this task is that sentences within certain sections typically exhibit similar patterns because of the way people write coherent text.The pattern can be found based on connectives or specificity of a sentence.For example, "Empirically" is usually used in the abstract or introduction sections in scientific writing.We construct the dataset from PeerRead (Kang et al., 2018), which consists of scientific papers from a variety of fields.The goal is to predict whether or not a sentence belongs to the Abstract section.After eliminating sentences that are too easy for the task (e.g., equations), we randomly sample sentences from the Abstract or from a section in the middle of a paper. 3Figure 7 shows two sentences from this task, where the first sentence is more general and from an Abstract whereas the second is more specific and is from another section.In this task, the input to the classifier is simply the sentence embedding.
Table 1 shows the number of instances in each DiscoEval task introduced above.
Having described DiscoEval, we now discuss methods for incorporating discourse information into sentence embedding training.All models in our experiments are composed of a single encoder and multiple decoders.The encoder, parameterized by a bidirectional Gated Recurrent Unit (Bi-GRU; Chung et al., 2014), encodes the sentence, either in training or in evaluation of the downstream tasks, to a fixed-length vector representation (i.e., the average of the hidden states across positions).
The decoders take the aforementioned encoded sentence representation, and predict the targets we define in the sections below.We first introduce Neighboring Sentence Prediction, the loss for our baseline model.We then propose additional training losses to encourage our sentence embeddings to capture other context information.

Neighboring Sentence Prediction (NSP)
Similar to prior work on sentence embeddings (Kiros et al., 2015;Hill et al., 2016), we use an encoded sentence representation to predict its surrounding sentences.In particular, we predict the immediately preceding and succeeding sentences.All of our sentence embedding models use this loss.Formally, the loss is defined as where we parameterize p θ and p φ as separate feedforward neural networks and compute the logprobability of a target sentence using its bag-ofwords representation.

Nesting Level (NL)
A table of contents serves as a high level description of an article, outlining its organizational structure.Wikipedia articles, for example, contain rich tables of contents with many levels of hierarchical structure.The "nesting level" of a sentence (i.e., how many levels deep it resides) provides information about its role in the overall discourse.To encode this information into our sentence representations, we introduce a discriminative loss to predict a sentence's nesting level in the table of contents: where l t represents the nesting level of the sentence s t and p θ is parameterized by a feedforward neural network.Note that sentences within the same paragraph share the same nesting level.In Wikipedia, there are up to 7 nesting levels.

Sentence and Paragraph Position (SPP)
Similar to nesting level, we add a loss based on using the sentence representation to predict its position in the paragraph and in the article.The position of the sentence can be a strong indication of the relations between the topics of the current sentence and the topics in the entire article.For example, the first several sentences often cover the general topics to be discussed more thoroughly in the following sentences.To encourage our sentence embeddings to capture such information, we define a position prediction loss where sp t is the sentence position of s t within the current paragraph and pp t is the position of the current paragraph in the whole document.

Section and Document Title (SDT)
Unlike the previous position-based losses, this loss makes use of section and document titles, which gives the model more direct access to the topical information at different positions in the document.The loss is defined as Where st t is the section title of sentence s t , dt t is the document title of sentence s t , and p θ and p φ are two different bag-of-words decoders.

Setup
We train our models on Wikipedia as it is a knowledge rich textual resource and has consistent structures over all documents.Details on hyperparameters are in the supplementary material.When evaluating on DiscoEval, we encode sentences with pretrained sentence encoders.Following Sent-Eval, we freeze the sentence encoders and only learn the parameters of the downstream classifier.The "Baseline" row in Additionally, we benchmark several popular pretrained sentence encoders on DiscoEval, including Skip-thought, 4 InferSent (Conneau et al., 2017), 5 DisSent (Nie et al., 2019), 6 ELMo, 7 and BERT. 8For ELMo, we use the averaged vector of all three layers and time steps as the sentence representations.For BERT, we use the averaged vector at the position of the "[CLS]" token across all layers.We also evaluate per-layer performance for both models in Section 6.

Results
Table 2 shows the experiment results over all Sent-Eval and DiscoEval tasks.Different models and training signals have complex effects when performing various downstream tasks.We summarize our findings below: • On DiscoEval, Skip-thought performs best on RST-DT.DisSent performs strongly for PDTB tasks but it requires discourse markers from PDTB for generating training data.BERT has the highest average by a large margin, but ELMo has competitive performance on multiple tasks.• The NL or SPP loss alone has complex effects across tasks in DiscoEval, but when they are combined, the model achieves the best performance, outperforming our baseline by 0.6% on average.In particular, it yields 39.3% accuracy on PDTB-I, outperforming Skip-thought by 0.6%.This is presumably caused by the differing, yet complementary, effects of these two losses (NL and SPP).• The SDT loss generally hurts performance on DiscoEval, especially on the position-related tasks (SP, BSO).This can be explained by the notion that consecutive sentences in the same section are encouraged to have the same sentence representations when using the SDT loss.However, the SP and BSO tasks involve differentiating neighboring sentences in terms of their position and ordering information.

Analysis
Per-Layer analysis.To investigate the performance of individual hidden layers, we evaluate ELMo and BERT on both SentEval and DiscoEval using each hidden layer.For ELMo, we use the averaged vector from the targeted layer.For BERT-Base, we use the vector from the position of the "[CLS]" token.Figure 8 shows the heatmap of performance for individual hidden layers.We note that for better visualization, colors in each column are standardized.On SentEval, BERT-Base performs better with shallow layers on USS, SSS, and Probing (though not on SC), but on Disco-Eval, the results using BERT-Base gradually increase with deeper layers.To evaluate this phenomenon quantitatively, we compute the average of the layer number for the best layers for both ELMo and BERT-Base and show it in Table 3.
From the table, we can see that DiscoEval requires deeper layers to achieve better performance.We assume this is because deeper layers can capture higher-level structure, which aligns with the information needed to solve the discourse tasks.
DiscoEval architectures.In all DiscoEval tasks except DC, we use no hidden layer in the neural architectures, following the example of SentEval.However, some tasks are unsolvable with this simple architecture.In particular, the DC tasks have low accuracies with all models unless a hidden layer is used.As shown in Table 4, when adding a hidden layer of 2000 to this task, the performance on DC improves dramatically.This shows that DC requires more complex comparison and inference among input sentences.Our human evaluation below on DC also shows that human accuracies exceed those of the classifier based on sentence embeddings by a large margin.
Human Evaluation.We conduct a human evaluation on the Sentence Position, Binary Sentence Ordering, and Discourse Coherence datasets.A native English speaker was provided with 50 examples per domain for these tasks.While the results in Table 5 show that the overall human accuracies exceed those of the classifier based on BERT-Large by a large margin, we observe that within some specific domains, for example Wiki in BSO, BERT-Large demonstrates very strong performance.
Does context matter in Sentence Position?In the SP task, the inputs are the target sentence together with 4 surrounding sentences.We study the effect of removing the surrounding 4 sentences, i.e., only using the target sentence to predict its position from the start of the paragraph.Table 6 shows the comparison of the baseline model performance on Sentence Position with or without the surrounding sentences and a random baseline.Since our baseline model is already trained with NSP, it is expected to see improvements over a random baseline.The further improvement from using surrounding sentences demonstrates that the context information is helpful in determining the sentence position.

Conclusion
We proposed DiscoEval, a test suite of tasks to evaluate discourse-related knowledge encoded in pretrained sentence representations.We also proposed a variety of training objectives to strengthen encoders' ability to incorporate discourse information.We benchmarked several pretrained sentence encoders and demonstrated the effects of the proposed training objectives on different tasks.While our learning criteria showed benefit on certain classes of tasks, our hope is that the Disco-Eval evaluation suite can inspire additional research in capturing broad discourse context in fixed-dimensional sentence embeddings.

Figure 2 :
Figure 2: Example in the PDTB explicit relation task.The words in [] are taken out from input sentence 2.

Figure 4 :
Figure 4: Example from the ROC Stories domain of the Sentence Position task.The first sentence should be in the fourth position.

Figure 5 :
Figure 5: Example from the arXiv domain of the Binary Sentence Ordering task (incorrect ordering shown).

Figure 6 :
Figure 6: An example from the Wikipedia domain of the Discourse Coherence task.This sequence is not coherent; the boldface sentence was substituted in for the true fifth sentence from another article.

Figure 7 :
Figure 7: Examples from Sentence Section Prediction.The first is from an Abstract while the second is not.

Table 1 :
Size of datasets in DiscoEval.

Table 2 :
Table 2 are embeddings trained with only the NSP loss.The subsequent rows are trained with extra losses defined in Section 4 in addition to the NSP loss.Results for SentEval and DiscoEval.The highest number in each column is boldfaced.The highest number for our models in each column is underlined."All" uses all four losses."avg." is the averaged accuracy for all tasks in DiscoEval.

Table 3 :
Average of the layer number for the best layers in SentEval and DiscoEval.

Table 4 :
Accuracies with baseline encoder on Discourse Coherence task, with or without a hidden layer in the classifier.

Table 5 :
Accuracies (%) for a human annotator and BERT-Large on Sentence Position, Binary Sentence Ordering, and Discourse Coherence tasks.

Table 6 :
Accuracies (%) for baseline encoder on Sentence Position task when using downstream classifier with or without context.