BERT-enhanced Relational Sentence Ordering Network

In this paper, we introduce a novel BERT-enhanced Relational Sentence Ordering Network (referred to as BERSON) by leveraging BERT for capturing a better dependency relationship among sentences to enhance the coherence modeling for the entire paragraph. In particular, we develop a new Relational Pointer Decoder (referred as RPD) by incorporating the relative ordering information into the pointer network with a Deep Relational Module (referred as DRM), which utilizes BERT to exploit the deep semantic connection and relative ordering between sentences. This enables us to strengthen both lo-cal and global dependencies among sentences. Extensive evaluations are conducted on six public datasets. The experimental results demonstrate the effectiveness and promise of BERSON, showing a signiﬁcant improvement over the state-of-the-art by a wide margin.


Introduction
Coherence modeling is one of the essential aspects of natural language processing Mesgar et al., 2019;Moon et al., 2019;Farag and Yannakoudakis, 2019). A coherent text can facilitate understanding and avoid the confusion for reading comprehension. The Sentence Ordering task (Barzilay and Lapata, 2008) aims to reconstruct a coherent paragraph from an unordered set of sentences and has shown to be beneficial to improve the coherence in many NLP tasks including multi-document summarization (Barzilay and Elhadad, 2002;Nallapati et al., 2017), conversational analysis (Zeng et al., 2018), and text generation (Konstas and Lapata, 2013;Holtzman et al., 2018). Table 1 shows an example of this task.
In recent years, several approaches based on ranking or sorting frameworks have been devel- Table 1: Illustration of the sentence ordering task. It aims to reorganize an unordered set of sentences into a coherent paragraph. oped to deal with this task. RankTxNet (Kumar et al., 2020) computes a score for each sentence and sorts these scores with ranking based loss functions. Pairwise Model  adopts a pairwise ranking algorithm to learn the relative order of each sentence pair. B-TSort (Prabhumoye et al., 2020) predicts the constraint between two sentences and uses the topological sort technique to find the ordering.
On the other hand, to better capture the global coherence, pointer network (Vinyals et al., 2015) has been gradually used for the decoder of the ordering model. It is able to capture the paragraphlevel contextual information for generating an ordered sequence with the highest coherence probability (Gong et al., 2016;Logeswaran et al., 2018;Cui et al., 2018;Yin et al., 2019). Further, HAN (Wang and Wan, 2019) and TGCM (Oh et al., 2019) introduce the attention mechanism (Vaswani et al., 2017), and FUDecoder (Yin et al., 2020) proposes pairwise ordering prediction modules to enhance the traditional pointer network.
Despite having achieved great successes, pairwise ranking and pointer network-based ordering approaches have a few problems. The former focuses on learning the local relationship between sentence pairs, but may have trouble in capturing the global interactions among all the sentences. The latter overlooks the importance of learning relative order between sentence pairs through the encoder-decoder, and lacks enough local interac- Figure 1: The architecture of the proposed BERSON. Given an unordered set of sentences, our BERT-based Hierarchical Relational Sentence Encoder first builds the high-level representation for each input sentence. Then, a self-attention based paragraph encoder is employed for paragraph encoding. Finally, the proposed Relational Pointer Decoder generates an ordered out sequence. For the sentence generation at the 3rd timestep in the decoder, s 1 and s 2 are the previous sorted sentences, and s 3 , s 4 , and s 5 are the unsorted ones. Here, we use the candidate sentence s 3 as an example to illustrate how to encode the relative ordering information for it in the pointer network based on Deep Relational Module. Please refer to Section 2.4 for more details of the decoder. tions among sentences.
To address the above limitations, in this paper, we propose a novel BERT-enhanced Relational Sentence Ordering Network (referred to as BERSON) by integrating BERT (Devlin et al., 2019) with the pointer network to fully exploit the pairwise relationships between sentences for a better coherence modeling. Specifically, we first introduce a BERT-based Hierarchical Relational Sentence Encoder, which uses sentence pairs as the input to the model and learns the high-level representation for each sentence. Next, a Self-Attention based Paragraph Encoder is adopted for paragraph encoding.
Building upon the above pairwise sentence and paragraph encoding, a novel Relational Pointer Decoder (referred to as RPD) is developed by incorporating the informative relative ordering information into the pointer network with a Deep Relational Module (referred to as DRM). This module leverages the Next Sentence Prediction objective of BERT to learn the relative ordering between sentences and constructs a pairwise relationship representation for each sentence pair, which helps RPD not only exploit the global orientation information among unordered sentences but also consider the local coherence between the candidate sentence and the previously sorted ones. Thus, RPD is able to generate a more coherent order assignment for the input sentences. In addition, the pairwise ordering prediction loss is also added as the auxiliary objective to guide the coherence modeling in the training procedure. The overall architecture of our model is presented in Figure 1.
Extensive experiments are conducted on six public datasets in different domains to evaluate the performances of BERSON. The results show that BERSON significantly outperforms the existing approaches by a wide margin and achieves a state-of-the-art performance on all the datasets and under all the evaluation measurements.

Relational Sentence Ordering Network
In this section, we start by formulating the sentence ordering problem and then present the proposed model BERSON, which is composed of a BERT-based Hierarchical Relational Sentence Encoder, a Self-Attention based Paragraph Encoder, and a Relational Pointer Decoder enhanced by a new Deep Relational Module to model the text coherence in a more effective way.

Problem Definition
Given an out-of-order version set of N sentences s = [s 1 , s 2 , · · · , s N ], and s i = [w i1 , w i2 , · · · , w il i ], where l i is the number of words in sentence s i . The model aims to recover the correct order o = [o 1 , o 2 , · · · , o N ] for these sentences.

Hierarchical Relational Sentence Encoder
The sentence encoder is designed based on BERT with sentence pairs in the set as input, and further adopts two-level attention layers to encode the hierarchical semantic concepts and contextual information of the sentence. Formally, for the given N sentences in the set, all the pair of sentences can be denoted as: where P i j represents the sentence pair s i , s j . The total number of sentence pairs is |P| = A 2 N . These sentence pairs are sent into BERT to not only learn the sentence representation but also capture the pairwise relationship between sentences.
As shown in the left part of Figure 1, given a sentence pair P i j = s i , s j , the input sequence of this pair to the BERT model consists of a [CLS] token, the first sentence s i in the pair, a separator token [SEP], and the second sentence s j . The BERT model encodes the representation for this pair as: where C i j and S i j are the final hidden states of the [CLS] and [SEP] tokens, and h P i j i1 , · · · , h P i j il i and h P i j j1 , · · · , h P i j jl j are the output word representations of sentence s i and s j in this pair with the sequence length l i and l j respectively.
After the BERT encoder, we compose a fixeddimensional representation for each sentence.
For sentence s i in pair P i j , the representations of each word are combined together with an attention mechanism to obtain its sentence representation h P i j i : where W w and v w are learnable parameters. Attention allows the model to concentrate on the informative words for coherence and helps build a better semantic representation. Similarly, we also compute the representation h P i j j for s j in pair P i j . Further, all the sentence pairs related to sentence s i can be described as: The number of pairs is 2N − 2. The corresponding sentence representations of s i obtained from these pairs are given as: for simplification. Since the sentence embeddings of s i in different pairs capture different context features, to reward the most salient features that contribute highly to the overall contextual meaning of the sentence, a high-level attention mechanism is adopted to establish the final representation x i for sentence s i : where W s and v s are also trainable weights. Essentially, as all the related sentence pairs are fairly considered, it is ensured that this representation being invariant to the input sentence order and being logically reliable to be used in our model.

Paragraph Encoder
After the sentence encoder, a self-attention based paragraph encoder is employed to capture the global dependency for all the sentences. Specifically, the sentence representations obtained from the sentence encoder are packed together into a paragraph matrix X = [x 1 , · · · , x N ] as X (1) , which is then sent to L self-attention layers (Vaswani et al., 2017). For the l-th layer, the output matrix X (l) is computed as: where MultiHead(·) is multi-head attention function, FFN(·) denotes the fully-connected feedforward network, and LN(·) is the layer normalization operation (Ba et al., 2016). The final paragraph vector m is generated by averaging the output matrix X (L) from the last selfattention layer: n is the n-th row in X (L) . This vector will then be used as the initial state of our decoder.

Relational Pointer Decoder
In this section, we propose a Relational Pointer Decoder (RPD), which utilizes the useful relative ordering information to enhance pointer network with a Deep Relational Module (DRM). In the following, we first describe the new module DRM and then incorporate it into the pointer network to strengthen the coherence modeling in the decoder.

Deep Relational Module
Our Deep Relational Module is based on BERT model, which aims to capture a better dependency relationship between sentences. The architecture of this module is shown in the middle part of Figure 1.
In particular, as illustrated in Section 2.2, given the sentence pair P i j , the embedding of the [CLS] symbol from the top layer of BERT is denoted as C i j . Owing to the Next Sentence Prediction pretraining objective of BERT, this vector C i j is able to aggregate the semantic relations for the input sentence pair and is capable of identifying the relative order between two sentences. Therefore, we take full advantages of this vector to exploit the latent dependency for sentences.
Further, a probability distribution P(r |s i , s j ) is generated, r ∈ {before, after}, which measures the probability of s i occuring before or after s j : where W c denotes the learnable weights.
In order to obtain the richer pairwise relation information for the sentence pair, we combine the above semantic feature C i j and the probability distribution together: This new vector R i j is considered as the relational representation for this sentence pair (s i , s j ), which is then leveraged to provide the relative order information for the pointer network. We compute such pairwise relational representation for all the sentence pairs in the paragraph, and utilize the subset of them at each step of the decoder. Different from the previous method of using the learned sentence vectors to calculate the pairwise relationship between sentences (Yin et al., 2020), DRM employs the whole sequence of the sentence pair as the input to BERT. It allows us to directly relate words from different sentences together, which is more straightforward to exploit the intrinsic relations and coherence between sentences. Further, instead of relying on the modules trained from scratch to control the pairwise ordering predictions (Yin et al., 2020), DRM adopts BERT as the main building block to obtain a pairwise relationship representation for the sentence pair. Intuitively, being pre-trained on the large corpus in BERT, this representation encodes more reliable and accurate relative ordering information, and thus is more effective to help determine the pairwise ordering predictions in the decoder.

Integrating DRM with Pointer Network
As illustrated in the right part of Figure 1, Relational Pointer Decoder (RPD) incorporates Deep Relational Module into the pointer network to promote the coherence modeling among sentences.
Formally, the conditional coherence probability of a predicted order o for the given out-of-order sentence set s can be computed as: A higher probability indicates a more coherent sentences assignment. We employ an LSTMbased pointer network as the basis of our decoder, and the mathematical formulation for the i-th step in the decoder is: The softmax function produces an output distribution over all unordered sentences (candidate sentences). The one that yields the highest probability from the distribution will be selected at position i.
The matrix Z i encodes the relationship representation information of the candidate sentence with the other sentences in the set. For one candidate sentence, the other sentences can be divided into two groups: previously sorted subset and unsorted subset. The relative ordering information between the candidate sentence and the two group sentences are captured by the proposed DRM with its two versions: Ordered Module and Unordered Module, respectively. On the one hand, such modeling helps evaluate the local coherence between the previously sorted sentences and the candidate sentence for investigating the rationality of each candidate choice. On the other hand, the global relative orientation information of other unsorted sentences with respect to the candidate one also provides further clues for the current prediction. Thus, both the local dependency information and the global orientation are fully exploited in RPD.
For the Ordered Module, the pairwise relationships between the predicted sentence s o i−1 at step i − 1 and the candidate sentence s c can be effectively measured by our deep relational module with the following relational representation: which not only encodes the semantic relations between two sentences, but also includes the probability of whether sentence s c truly appears after s o i−1 or not. In similar ways, the relational embedding generated with the previous ordered sentences can be described as Then, we compose a high-level local coherence representation e l (s c ) for this candidate sentence by integrating these relational embeddings to summarize the overall local dependency for s c . For the Unordered Module, the relative orientation of another unordered sentence s g with respect to the candidate sentence s c can also be captured by our relational embedding as: Considering all the other unsorted sentences, a hierarchical global orientation representation e g (s c ) for s c can be obtained, which is formulated as: where S c is the unordered sentence set except s c . Subsequently, to leverage the relative ordering information encoded by Ordered and Unordered Modules simultaneously, the representation e l (s c ) and e g (s c ) are integrated together, which allows us to build a more informative relational vector e r (s c ) for sentence s c . Finally, a new representation for this candidate sentence s c is obtained by combining its sentence embedding and relational vector e r (s c ) together: Such representation is generated for all unsorted sentences, which are then packed into matrix Z i for order predictions. During inference, we use beam search to select sentences sequentially.

Model Training
Assume that there are Q paragraphs in the training set Q = {(s, o)}. Following the existing ordering networks (Gong et al., 2016;Oh et al., 2019), the model is trained to maximize the coherence probability by minimizing the loss function as follows: where θ denotes all the trainable parameters.
To further exploit the correct relative order information, we add the Pairwise Ordering Prediction Loss (Ploss) as an auxiliary objective L p . It is defined as the cross-entropy loss function optimized by minimizing the negative log-likelihood of each pair's ground-truth relative ordering label y i j ∈ [0, 1], given the networks prediction y i j : The final training objective of our model can be formulated as: where α is the coefficient that makes a balance between the influences of the two loss functions.

Experiments
In this section, we empirically evaluate the effectiveness of BERSON in the sentence ordering task.

Datasets
The experiments are conducted on six public datasets in different domains: NIPS abstract, AAN abstract, NSF abstract, arXiv abstract: These datasets contain abstracts of research papers. NIPS abstract 1 is from conference papers in NIPS, where papers in years 2005-2013/2014/2015 for training/validation/testing (Logeswaran et al., 2018). AAN abstract (Logeswaran et al., 2018)   training/validation/testing. ArXiv abstract (Gong et al., 2016; is from arXiv website 2 . The validation and test sets of this dataset are the first and last 10% abstracts from the shuffled data, and the remaining data are for training. SIND, ROCStory: SIND is a visual storytelling dataset 3 , which is released as training/validation/testing following the 8:1:1 split. ROCStory is a commonsense story dataset 4 (Wang and Wan, 2019;. It is randomly split by 8:1:1 for the training/validation/test sets. Both of two datasets consist of 5 sentences in each story text. Table 2 shows the details of all the datasets.

Evaluation Metrics
Following the existing work (Oh et al., 2019), we employ the three most commonly used metrics 5 in this task to assess the model performance: Accuracy (Acc): This metric calculates the ratio of sentences whose absolute positions are correctly predicted (Logeswaran et al., 2018).
where # inversions denotes the number of pairs in the predicted sequence with the incorrect relative order (Lapata, 2003). The score ranges from -1 (the worst) to 1 (the best).
A higher score indicates a better performance for all the metrics.

Experimental Setup
We adopt the BERT BASE in the experiment and fine-tune it on each dataset. The paragraph en-coder has 2 self-attention layers with 8 heads. The hidden size is 768 and beam size is 16. Adam is employed as the optimizer. To search for the optimal hyper-parameters, we adopt the grid search strategy for learning rate from {2e-5, 5e-5}, batch size from {8, 16, 32}, the number of epochs from {5, 10, 20}, and the coefficient α in the loss function from {0.2, 0.4, 0.6, 0.8, 1.0}. The model with the best performance on the validation set is selected for each setting. The recommended hyperparameter configuration of the model on each dataset are presented in Table 3. To diminish the effects of randomness in training, the results of our model are averaged with 5 random initializations. For data preprocessing, we use the tokenizer 6 from BERT to preprocess the sentences. The experiments are conducted on GeForce GTX 1080Ti GPU with PyTorch framework .

Baselines
To demonstrate that BERSON truly improves the sentence ordering performance, we compare it with the state-of-the-art methods in this task, which can be categorized into two classes: (1) Ranking or Sorting frameworks: Pairwise Model ; RankTxNet (Kumar et al., 2020); B-TSort (Prabhumoye et al., 2020).
In addition to the above existing approaches, we also investigate three variants of BERSON. BertSenPD: This model replaces the ranking module in RankTxNet with the traditional pointer network decoder (PD). Please note that it uses the single sentence rather than sentence pair as the input to BERT to obtain the sentence vector. BertPairPD, HRSEPD: These two models employ the sentence pair encoding strategy with BERT and utilize PD instead of our RPD as the decoder. HRSEPD adopts the proposed Hierarchical Relational Sentence Encoder (HRSE), while Bert-PairPD does not have two-level attention layers in the encoder. They aim to investigate the impact of both the Hierarchical Relational Sentence Encoder and the Relational Pointer Decoder.  Table 4: Comparison results for different models on sentence ordering task. The best and second-best results are in bold and underlined respectively.

Main Results
The experimental results 7 are reported in Table 4.
As we see, BERSON achieves the state-of-the-artperformance on all the datasets and under all the evaluation metrics.
The results show that BERSON significantly outperforms all the existing methods by a large margin. BERSON shows remarkable improvements over the existing best systems of 12.39% and 16.77% accuracy score on NIPS and arXiv datasets, and with 15.42%, 11.37%, and even 22.23% gains in PMR score on NIPS, SIND, and ROCStory datasets respectively, which strongly demonstrates the effectiveness of our model 8 .
Compared with the existing ranking approaches, our BertSenPD baseline performs much better than RankTxNet with stable improvements, which confirms the superiority of the traditional pointer network to the ranking module used in their model. This could be due to that RankTxNet only computes a score for each sentence in parallel, which overlooks the coherence of the whole predicted sequence and may have trouble in generating a more coherent order assignment. Besides, although B-TSort outperforms RankTxNet with clear improvements, it only considers the sentence-pair interactions and does not take the entire paragraph into account. Therefore, B-TSort is limited by the lack of a global structure and falls behind other baselines for Acc and PMR scores on the large dataset NSF abstract. In contrast, BERSON not only captures the local coherence between every two sentences but also obtains the paragraph-level contextual information for the global dependency, hence being more competitive in the sentence ordering.
In addition, among the pointer network based ordering models, FUDecoder exhibits a better performance. However, the ordering prediction modules of FUDecoder are built based on two nonlinear layers trained from scratch and with the learned sentence vectors as the input, which is still difficult to fully explore the latent dependency among sentences. Once these modules are not sufficiently trained especially on small datasets, they may mislead the decoder with the wrong relative orientation information. Our BERSON overcomes the limitation of FUDecoder by utilizing BERT model as the main building block of our DRM to improve the pairwise ordering strategy. As shown in Table 4, BERSON achieves to outperform FUDecoder with significant improvements of about 14.32% and 22.23% PMR score on SIND and ROCStory respectively, which proves the promise of incorporating more reliable ordering module into the decoder to ensure the more accurate relative ordering information.
Moreover, for the variants of our model, Bert-PairPD and HRSEPD perform better than Bert-SenPD on all the datasets. This shows that with the sentence pair instead of single sentence as the input to BERT, the model directly builds the inter-  Table 5: Ablation studies on arXiv and ROCStory datasets. We remove various modules and explore their influences to our model. actions between words from different sentences, which is capable of capturing the rich contextual information for each sentence and is more beneficial to modeling the relations among sentences. Besides, HRSEPD outperforms BertPairPD with stable improvements, which reflects the strength of our Hierarchical Relational Sentence Encoder. Furthermore, by adopting our Relational Pointer Decoder to replace the traditional pointer network, BERSON achieves further improvements across the datasets, which demonstrates the advantage of enhancing the pointer network with DRM to reach a superior performance.

Ablation Study
Further, to better understand the contributions of different components in our Relational Pointer Decoder, we conduct ablation study on arXiv and ROCStory datasets, which are both the largest datasets in the two domains for providing more reliable analysis. The results are reported in Table 5.
Effect of Ordered and Unordered Modules: It is observed that the removal of two modules hurts the model performance dramatically though they still outperform our baseline models BertPairPD and HRSEPD. Compared with Ordered Module, the lack of Unordered Module leads noticeable drops, which indicates that the relative orientations between unsorted sentences are more important for  Table 6: Accuracy of predicting the first and the last sentences on arXiv and SIND datasets.
order predictions. The superior performance of BERSON over these two variants shows the necessity of having both modules in RPD to leverage the global orientation and local dependency information simultaneously for a better coherence model.

Effect of Pairwise Ordering Prediction Loss:
As shown in Table 5, removing the Pairwise Ordering Prediction Loss (Ploss) in the training procedure causes a performance degradation on both datasets. This proves the benefit of encouraging the accurate relative ordering information through the loss function. As the coefficient α in Equation 16 directly controls the impact of Ploss, we further study how the value of this coefficient affects the performance of BERSON. Figure 3 shows the results of accuracy score on arXiv and ROCStory datasets. It is shown that α = 0.8, 0.6 is superior to other settings for arXiv and ROCStory respectively. Thus, it is essential to have an appropriate value to balance the importance of Ploss and the original training objective for BERSON.

Analysis
In this section, we delve into further analysis to investigate the stability and adaptability of the proposed model.

Prediction of First and Last Sentences
Previous studies (Oh et al., 2019;Yin et al., 2019) have mentioned that both the first and last sentences play crucial roles in a paragraph due to their special positions. Thus, we also report the performances of our models in correctly predicting these two sentences on arXiv and SIND datasets. As summarized in Table 6, both of the two variants BertSenPD and BertPairPD outperform the existing state-of-the-arts. BERSON achieves fur-   ther boosts and reaches the best performances on both datasets. For identifying the last sentence, BERSON obtains significant improvements over RankTxNet of 7.56% and 5.19% gain on arXiv and SIND datasets respectively, which also indicates the benefits of the proposed model.

Sentence Displacement Analysis
Additionally, we analyze the displacement of sentences in the predicted orders by calculating the percentage of sentences whose predicted location is within one, two or three positions from their original location (Prabhumoye et al., 2020). The higher score is better, which denotes less displacement of sentences. As summarized in Table 7, BERSON also achieves a better performance than B-TSort across the datasets and on all the window sizes especially for the smaller ones. BERSON even reaches 99% percent when the window size is 3 on NIPS, AAN, and SIND datasets, which clearly demonstrates the promise of BERSON.

Performance on Longer Paragraphs
Following the prior approach (Prabhumoye et al., 2020), we also evaluate the model performance on paragraphs longer than 10 sentences, which are much challenging for the order prediction. In addition to the three metrics adopted in the previous sections, here we also utilize two other metrics: Longest Common Subsequence (LCS) and Rouge-S for a more comprehensive comparison 9 . Table  8 reports the results on NIPS and AAN datasets. BERSON significantly outperforms B-TSort with all the metrics, showing more than 10% gain in accuracy score on both datasets. Besides, the results of PMR score indicate that it is difficult for B-TSort to exactly match orders for all the sentences, while our BERSON consistently shows a good potential on these longer paragraphs, which proves the stronger ability of BERSON in modeling the long-range dependency across the sentences.

Conclusion
In this work, we develop a new BERTenhanced Relational Sentence Ordering Network (BERSON) by integrating BERT with the pointer network for a better coherence modeling. In particular, a novel Relational Pointer Decoder is developed to incorporate the relative ordering information into the pointer network with a Deep Relational Module, which leverages BERT to fully exploit the pairwise relationships between sentences helping generate an ordered sequence. The experiments on six datasets demonstrate the superiority of BERSON to the baselines, which achieves the state-of-the-art performance across the datasets.
the order predictions. As we see, in the first example, BERSON achieves to exact match orders for all the sentences while all the baseline methods have some incorrect order predictions. For the second input paragraph, BERSON is able to correctly predict the order for the most of sentences which also shows a better performance than the competing models.
A.2 Two other metrics used in the Analysis Longest Common Subsequence (LCS): It calculates the percentage of longest correct subsequence between the predicted order and the gold order (Gong et al., 2016). The consecutiveness is not necessary for it 10 . Rouge-S: This metric Gong et al., 2016) measures the fraction of pairs of sentences whose predicted relative order is the same as the ground truth order 11 . It allows for any arbitrary gaps between two sentences as long as their relative order is correctly identified. A higher score is better for both metrics.

A.3 Discussion of Topic Shift Problem
BERSON captures both global and local coherence among sentences, which is effective in reorganizing texts with multiple topics. In particular, the paragraph encoder is able to model the global topic information for all the sentences, which helps guide the order prediction process for the decoder. In addition, building upon the Next Sentence Prediction pre-training objective of BERT, the Deep Relational Module captures the local dependency relationship between each pair of sentences and identifies the tight semantic connections for sentence ordering, especially identifying the sentences containing topic shift clues of the whole text and acting as a link between the preceding and the following topics. Further, the Relational Pointer Decoder leverages both the topical context flows from the previously predicted sequences and from the unsorted sentences to generate an accurate order prediction for these topic-linking sentences and their neighboring ones. Therefore, BERSON is capable of generating a logically consistent output sequence for texts including texts with topic shift.

A.4 Further Experimental Results
For the more detailed experimental results,