Contextualized and Generalized Sentence Representations by Contrastive Self-Supervised Learning: A Case Study on Discourse Relation Analysis

We propose a method to learn contextualized and generalized sentence representations using contrastive self-supervised learning. In the proposed method, a model is given a text consisting of multiple sentences. One sentence is randomly selected as a target sentence. The model is trained to maximize the similarity between the representation of the target sentence with its context and that of the masked target sentence with the same context. Simultaneously, the model minimizes the similarity between the latter representation and the representation of a random sentence with the same context. We apply our method to discourse relation analysis in English and Japanese and show that it outperforms strong baseline methods based on BERT, XLNet, and RoBERTa.


Introduction
Understanding the meaning of a sentence is one of the main interests of natural language processing. In recent years, distributed representations are considered to be promising to capture the meaning of a sentence flexibly (Conneau et al., 2017;Arora et al., 2017;. One typical way to obtain distributed sentence representations is to learn a task that is somehow related to sentence meaning. For example, sentence representations trained to solve natural language inference (Bowman et al., 2015;Williams et al., 2018) are known to be helpful for many language understanding tasks such as sentiment analysis and semantic textual similarity (Conneau et al., 2017;Wieting and Gimpel, 2018;Cer et al., 2018;Reimers and Gurevych, 2019).
However, there is an arbitrariness in the choice of tasks used for training. Furthermore, there is a size limitation on manually annotated data, which makes it hard to learn a wide range of language expressions.
A solution to these problems is self-supervised learning, which has been used with great success (Mikolov et al., 2013;Peters et al., 2018;Devlin et al., 2019). For example, inspired by skipgrams (Mikolov et al., 2013),  proposed to train a sequence-to-sequence model to generate sentences before and after a sentence, and use the encoder to compute sentence representations. Inspired by masked language modeling in BERT, Zhang et al. (2019) and Huang et al. (2020) presented methods to learn contextualized sentence representations through the task of restoring a masked sentence from its context.
In self-supervised sentence representation learning, sentence generation is typically used as its objective. Such an objective aims to learn a sentence representation specific enough to restore the sentence, including minor details. On the other hand, in case we would like to handle the meaning of a larger block such as paragraphs and documents (which is often called context analysis) and consider sentences as a basic unit, a more abstract and generalized sentence representation would be helpful.
We propose a method to learn contextualized and generalized sentence representations by contrastive self-supervised learning (van den Oord et al., 2019;Chen et al., 2020). In the proposed method, a model is given a text consisting of multiple sentences and computes their contextualized sentence representations. During training, one sentence is randomly selected as a target sentence. The model is trained to maximize the similarity between the representation of the target sentence with its context, to which we refer as s pos , and the representation of the masked target sentence with the same context, to which we refer as s anc . Simultaneously, the model is trained to minimize the similarity between the latter representation s anc and the representation of a random sentence with the same context as the target sentence, to which we refer as s neg . From the viewpoint of optimizing s anc , this can be seen as a task to capture a generalized meaning that contextually valid sentences commonly have, utilizing s pos and s neg as clues. From the viewpoint of optimizing s pos , this can be seen as a task to generalize the meaning of a sentence to the level of s anc .
We show the effectiveness of the proposed method using discourse relation analysis as an example task of context analysis. Our experiments on English and Japanese datasets show that our method outperforms strong baseline methods based on BERT (Devlin et al., 2019), XLNet , and RoBERTa . Figure 1 illustrates the overview of our method. The encoder takes an input text consisting of T (> 1) sentences and computes their contextualized sentence representations. The encoder is trained by contrastive self-supervised learning.

Encoder
The encoder is a Transformer (Vaswani et al., 2017) with the same architecture as BERT (Devlin et al., 2019). Following Liu and Lapata (2019), we insert the 〈CLS〉 and 〈SEP〉 tokens at the beginning and the end of each sentence, respectively. The representation of the 〈CLS〉 token is used as the sentence representation of its following sentence.

Contrastive Objective
We propose a contrastive objective to learn contextualized sentence representations, aiming to capture sentences' generalized meaning.
We first randomly select one sentence from the input text as a target sentence. In Figure 1, the k-th sentence (1 ≤ k ≤ T ) is selected as a target sentence. We refer to the representation of the target sentence as s pos . We then create another input text by masking the target sentence with the 〈SENT-MASK〉 token. We refer to the representation of the masked sentence as s anc . We finally create yet another input text by replacing the target sentence with a random sentence. We refer to the representation of the replaced random sentence as s neg .
Our contrastive objective is to maximize the similarity between s pos and s anc while minimizing the similarity between s neg and s anc . We use the dot product as the similarity measure. When using N random sentences per input text, the contrastive loss L is calculated as follows: where ·, · is the dot product and S = {s pos , s 1 neg , · · · , s N neg }. To optimize s anc , the model needs to capture a generalized meaning that contextually valid sentences commonly have, using s pos and s neg as clues. On the other hand, to optimize s pos , the model needs to generalize the meaning of a sentence to the level of s anc .
The encoder is trained by optimizing the contrastive loss and the standard masked language modeling loss (Devlin et al., 2019) jointly.

Generative Objective
For comparison, we train the encoder through the task of generating a masked sentence from its context. We first mask a sentence in the input text with the 〈SENT-MASK〉 token. Given the text, the encoder computes the representation of the masked sentence. Then, given the representation, a decoder generates the masked sentence in an autoregressive manner. The decoder's architecture is almost the same as the encoder, but it has an additional layer on the top to predict a probability distribution over words. We use teacher forcing and compute the generative loss by summing crossentropy at each generation step.
The encoder and decoder are trained by optimizing the generative loss and the standard masked language modeling loss jointly.

English
We use an English Wikipedia dump and BookCorpus  1 to create input texts. We first split texts into sentences using spacy (Honnibal et al., 2020). We then extract as many consecutive sentences as possible so that the length does not exceed the maximum input length of 128. When a sentence is so long that an input text including the sentence cannot be created while meeting the length constraint, we give up using the sentence. The number of sentences in an input text T was 4.91 on average. After creating input texts, we assign random sentences to each of them. Random sentences are extracted from the same document. We assigned three random sentences per input text, i.e., N = 3.
We initialize the encoder's parameters using the weights of RoBERTa BASE . The other parameters are initialized randomly. We train the model for 10,000 steps with a batch size of 512. We use the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 2e-5, β 1 = 0.9, β 2 = 0.999, linear warmup of the learning rate over the first 1,000 steps, and linear decay of the learning rate.

Japanese
We use a Japanese Wikipedia dump to create input texts. We split the texts into clauses using KNP, a widely used Japanese syntactic parser (Kawahara and Kurohashi, 2006). We create input texts and assign random sentences to them in the same way as in Section 2.4.1. The number of sentences (clauses) in an input text T was 6.42 on average.
We initialize the encoder's parameters with BERT BASE , pretrained on a Japanese Wikipedia dump 2 . The other details are the same as in Section 2.4.1.

Discourse Relation Analysis
We show the effectiveness of the proposed method using discourse relation analysis as a concrete example of context analysis. Discourse relation analysis is a task to predict the logical relation between two arguments. An argument roughly corresponds to a sentence or a clause. We conduct experiments on English and Japanese datasets.

Datasets
3.1.1 Penn Discourse Tree Bank (PDTB) 3.0 PDTB 3.0 is a corpus of English newspaper with discourse relation labels (Prasad et al., 2018). We focus on implicit discourse relation analysis, where no explicit discourse marker exists. Following Kim et al. (2020), we use the Level-2 labels with more than 100 examples and use 12-fold crossvalidation.

Kyoto University Web Document Leads Corpus (KWDLC)
KWDLC is a Japanese corpus consisting of leading three sentences of web documents with discourse relation labels (Kawahara et al., 2014;Kishimoto et al., 2018). As KWDLC does not discriminate between implicit discourse relations and explicit discourse relations, we target both. KWDLC has seven types of discourse relations, including NORELATION. The evaluation protocol is 5-fold cross-validation. Following Kim et al. (2020), each fold is split at the document level rather than the individual example level.

Model
We train two types of models; one uses the context of arguments, and the other does not. When a model uses context, the model is given the paragraph that contains arguments of interest. In this setting, first, the paragraph is split into sentences. Arguments are treated as a single sentence, and their context is split in the way described in Section 2.4. Then, an encoder computes the representation of each sentence in the same manner as  Table 1: Results of implicit discourse relation analysis on PDTB 3.0 using the Level-2 label set (Kim et al., 2020). Gen and Con indicate that the encoder is further pretrained by optimizing the generative objective and the contrastive objective, respectively. The scores are the mean and standard deviation over folds.
in Section 2.1. Given the concatenation of the arguments' representations, a relation classifier predicts the discourse relation. As a relation classifier, we employ a multi-layer perceptron with one hidden layer and ReLU activation. When a model does not use context, the model is given arguments of interest only. In this setting, we use the sentence pair classification method proposed by Devlin et al. (2019).
Our proposed method is introduced to a contextusing model by initializing its encoder's parameters using our sentence encoder. In experiments, we report a difference in performance depending on models used for initialization.

Implementation Details
Input texts are truncated to the maximum input length of 512, which is long enough to hold almost all inputs. We train models for up to 20 epochs. At the end of each epoch, we compute the performance for the development data and adopt the model with the best performance. If the performance does not improve for five epochs, we stop the training. We use the Adam optimizer with a learning rate of 2e-5, β 1 = 0.9, β 2 = 0.999. We update all the parameters in models, i.e., pretrained sentence encoders are fine-tuned to solve discourse relation analysis. Table 1 shows the result for PDTB 3.0. The evaluation metric is accuracy. The highest performance was achieved by the proposed method. To our knowledge, this is the state-of-the-art performance among models with the same parameter size as BERT BASE . The model that optimized the genera-tive objective was inferior not only to the proposed method but also to vanilla RoBERTa with context. Table 2 shows the result for KWDLC. The evaluation metrics are accuracy and micro-averaged precision, recall, and F1 3 . The highest performance was again achieved by the proposed method. The decrease in performance by optimizing the generative objective is consistent with the experimental results on PDTB 3.0.

Qualitative Analysis
We show an example of discourse relation analysis in KWDLC.
(1) Arg1 新潟県にある国営公園・越後丘 陵公園へ、１泊で遊びに出掛けよう と Arg2 思い立ちました。 Arg1 I want to go to a government-managed park in Niigata Prefecture for an overnight visit, Arg2 I came up with that.

Label: NORELATION
Arguments are enclosed in and . The models except ours erroneously predicted the discourse relation of PURPOSE between Arg1 and Arg2. This is probably because the Japanese postpositional particle "と" can be a discourse marker of PURPOSE. For example, if Arg2 was "荷造りを始めた (I started packing)," the prediction would be correct. However, in this case, the postpositional particle " と " is used to construct a sentential complement. That is, Arg1 is the object of Arg2. It is not possible to distinguish between the two usages from its surface form. Our model correctly predicted the discourse relation of NORELATION, which implies that our method understood that Arg1 is a sentential complement.
We show another example of implicit discourse relation analysis in KWDLC.
(   Table 3: Results of sentence retrieval based on the cosine similarity between sentence representations computed by our method. · indicates a sentence. The query and retrieved sentences are marked in bold, and their contexts are shown together. The numbers indicate the rank of sentence retrieval.
proposed model correctly recognized the discourse relation of CAUSE/REASON. We speculate that the models other than ours failed to understand Arg1 at the level of "a happy event occurred."

Sentence Retrieval
To investigate what is learned by our contrastive objective, we did sentence retrieval based on the similarity between sentence representations. For targets, we randomly sampled 500,000 sentences with context from input texts used for training. For a query, we used a sentence with context in a Wikipedia article. Computing the sentence representations for the targets and query, we searched the closest sentences based on their cosine similarity. Table 3 shows an example. In addition to the topranked sentences, we also picked up some highlyranked sentences. The top two sentences were very similar to the query sentence regarding the topic, meaning, and context. While the sentences of lower rank had different topics from the query sentence, they all described a positive aspect of an entity and had a similar context in terms of that an entity is introduced in their preceding sentences. We con-firmed that almost the same results were obtained in Japanese. We leave quantitative evaluation of sentence retrieval for future work.

Conclusion
We proposed a method to learn contextualized and generalized sentence representations using contrastive self-supervised training. Experiments showed that the proposed method improves the performance of discourse relation analysis both in English and Japanese. We leave an in-depth analysis of the level of abstraction trained by the proposed method for future work.