Topic-Guided Coherence Modeling for Sentence Ordering by Preserving Global and Local Information

We propose a novel topic-guided coherence modeling (TGCM) for sentence ordering. Our attention based pointer decoder directly utilize sentence vectors in a permutation-invariant manner, without being compressed into a single fixed-length vector as the paragraph representation. Thus, TGCM can improve global dependencies among sentences and preserve relatively informative paragraph-level semantics. Moreover, to predict the next sentence, we capture topic-enhanced sentence-pair interactions between the current predicted sentence and each next-sentence candidate. With the coherent topical context matching, we promote local dependencies that help identify the tight semantic connections for sentence ordering. The experimental results show that TGCM outperforms state-of-the-art models from various perspectives.


Introduction
Modeling the coherence among sentences to compute their gold order is one of the fundamental tasks in Natural Language Processing (NLP) with many applications such as document modeling (Narayan et al., 2018a), extractive document summarization (Jadhav and Rajan, 2018;Nallapati et al., 2017), question answering Liu et al., 2017), conversational analysis (Zeng et al., 2018), automated text generation (Guo et al., 2018), and image captioning (Anderson et al., 2018). The coherence helps readers to improve reading comprehension and better understand the intent of a document. Sentence ordering is a setto-sequence problem, which aims to identify the correct order of a sentence set. To do this, various studies on sentence ordering typically combine the coherent features extracted from sentences.
In recent years, most of the traditional approaches to sentence ordering are designed based on a pairwise strategy Li and Jurafsky, 2016). The sentence pair ordering (SPO) models determine the relative order within a sentence pair via neural networks based semantic matching, which computes the relevance between the two sentence vectors. However, such models cause combinatorial optimization problems because search algorithms (e.g., beam search) are necessary to find the most optimal permutation. Since the SPO models only focus on the sentence-pair interactions (i.e., local dependency), they have trouble in capturing the interactions among three or more input sentences (i.e., global dependency) in an entire paragraph.
More recently, state-of-the-art models aim to put randomly sorted sentences into a coherent paragraph with the correct order so that the whole sentences have the highest coherence probability (Vinyals et al., 2015a;Gong et al., 2016;Logeswaran et al., 2018;Cui et al., 2018). Unlike the SPO models, these models can perform the sentence set ordering (SSO) task for an entire paragraph based on the pointer network (Vinyals et al., 2015b). Hierarchical RNN networks consisting of sentence and paragraph encoders take unordered sentences as input. They build a paragraph-level vector representation, which represents a semantic summary of the input sentences. Then, a pointer network based decoder fetches the paragraph vector and iteratively outputs sentences in the correct order. At this time, since the output sentences are taken from the input sentences, this can solve the combinatorial optimization problems, which the SPO models suffer from. Also, the SSO models can capture the global dependency among input sentences via the paragraph vector.
Despite the successes of the SSO models, there still exist severe limitations as shown in Figure 1.
The conventional pointer decoders for the SSO task only utilize the last hidden state at the end of paragraph encoders. The encoder always pushes sentence information into a single fixedlength paragraph vector, no matter how many sentences are in the paragraph. Thus, they may struggle with the bottleneck problem where important information between the encoder and the decoder is shrunk. Especially, the more the number of sentences, the more difficult to preserve the global information of the paragraph.
• [L2] The attention layer repeatedly decides the next sentence from sentence-pair interactions between the current predicted sentence and each paragraph-independent candidate {s i } 5 1 . Thus, they do not elaborately utilize the context of previously predicted sentences {s 1 , s 2 , s 3 } and the context of s 4 conditioned on the context of the paragraph (i.e., the shuffled sentences {s 2 , s 4 , s 5 , s 1 , s 3 }), where the coherent contexts indicate the local information that helps determine the tight s 3 -s 4 interactions.
To address the above limitations, this paper proposes a novel Topic-Guided Coherence Modeling (TGCM) for sentence ordering by capturing local and global dependencies among sentences. Specifically, TGCM is composed of two major components: topic-sensitive sentence encoder and attentive pointer decoder.
To complement the structural limitations of existing RNN-based pointer decoders, which sequentially decode one paragraph vector, our attentive pointer decoder relies entirely on the attention mechanism (Vaswani et al., 2017) without any recurrent units or convolutions. Since our decoder directly receives the set of sentence vectors regardless of their input order via attention, our encoder is free from the constraint that all sentences must be compressed into a single fixedlength vector as a paragraph representation. As a result, the preservation of global information improves the global dependencies among sentences and provides a relatively informative paragraphlevel semantics, which can deal with the bottleneck problem. Moreover, TGCM can better preserve the fine-grained word/sentence-level semantics from the encoder to the decoder.
Instead of paragraph-independent next-sentence candidates, the topic-sensitive sentence encoder enriches each candidate sentence with its topical context conditioned on the paragraph context via paragraph vector shuffled sentences

Content-Based Attention Layer
Figure 1: Pointer decoder for sentence ordering. Based on the paragraph vector encoded from the shuffled sentences, the attention layer selects the next sentence s 4 following s 3 from sentence-pair interactions between the previously predicted s 3 and each candidate {s i } 5 1 .
topic modeling based on Latent Dirichlet Allocation (LDA) (Blei et al., 2003). For each position of the decoder, TGCM also incorporates the topical context flow of previously predicted sentence sequence into the current predicted sentence. It then predicts the next sentence from topic-enhanced sentence-pair interactions with the coherent topical context as local information. Consequently, the preservation of local information promotes the local dependencies between the current predicted sentence and each next-sentence candidate, resulting in identifying the strongest semantic connection for ordering a shuffled paragraph.

Related Work
The key idea of sentence modeling is to embed each sentence into continuous vector spaces by combining word vectors with recurrent neural networks (RNNs) (Mikolov et al., 2010) to capture long-term dependencies between words and convolution neural networks (CNNs) (Kim, 2014) to capture important local invariance context. Sentence Pair Modeling: To identify the relationship between given two sentences, sentence pair modeling learns a semantic matching function based on neural networks, which extracts their task-specific features. As a famous example, question answering measures relevance between question-answer pairs with a matching function and ranks candidate answer sentences (Severyn and Moschitti, 2016;Liu and Huang, 2016). Recent models further measure the semantic similarity elaborately by using multi-attention mechanism (Tan et al., 2018;Tay et al., 2018) and a word-level similarity matrix (Pang et al., 2016;.
Another example is the SPO task.  compute a similarity score for each sen-tence pair independently. After pairwisely scoring, they use a beam search algorithm to find an optimal order for input sentences.  develop a pairwise scoring model to organize unordered image-caption pairs. However, such models have the combinatorial optimization problem of finding the most optimal permutation. Moreover, they cannot capture the paragraph-level contextual information, which indicates the global dependency of an entire paragraph.
Sentence Set Modeling: To solve the above issues, the pointer network (Vinyals et al., 2015b) is proposed based on the sequence-to-sequence model (Sutskever et al., 2014). The pointer network consists of encoding input tokens to a summary vector, decoding next token vectors iteratively via content-based attention over input token vectors, and producing the output token sequence from the output token vectors. Inspired by the pointer network, state-of-the-art SSO models (Vinyals et al., 2015a;Gong et al., 2016;Logeswaran et al., 2018) usually employ hierarchical RNN-based encoders to produce a paragraph vector from unordered sentences. Then, the pointer network based decoders predict the correct sentence sequence. However, the paragraph vector depends on the permutation of input sentences.
To address the issue, ATTOrderNet (Cui et al., 2018) employs self-attention at the encoder to capture global dependencies regardless of an input sentence order. However, similar to traditional models, they also compress sentence vectors into a single fixed-length vector via average pooling. While Our TGCM is also permutation-agnostic to input sentences, we feed the sentence vector set directly into the attentive pointer decoder and capture the coherent topical context in sentence-pair interactions via topic modeling. Thus, unlike AT-TOrderNet, TGCM can simultaneously improve both local and global dependencies.
Topic-Aware Sentence Modeling: The commonly used topic model is based on LDA, which extracts the latent topic vectors (i.e., topic distributions) of words, sentences, and paragraphs from a training corpus. Given sentences in a paragraph, their topic latent vectors help in capturing global topical context for the paragraph and local topical context for each sentence. Therefore, topic modeling is used in various research fields requiring sentence modeling (Narayan et al., 2018b;Dieng et al., 2016;Gong et al., 2018). LTMF (Jin et al., 2018) is a context-aware recommender system, which combines LSTMs and topic modeling before applying matrix factorization. They extract the global context information related to words in a user review via topic distributions.

The Proposed Model
In this section, we present TGCM as a novel topic-guided coherence modeling for SSO. TGCM can address the above-mentioned limitations [L1] and [L2] simultaneously. We first build a topicdistribution generating function via topic modeling. Then, we describe two major components: topic-sensitive sentence encoder and attentive pointer decoder. The encoder leverages the topic distributions of a paragraph and its sentences. Then, the decoder directly utilizes them in a permutation-agnostic manner and determines the correct order of randomly sorted sentences.

Problem Definition
The primary goal of sentence set ordering is to put an unordered set of sentences into a coherent paragraph in the correct order. Specifically, given a paragraph p, the correct sentence sequence and its order are denoted by S p = [s 1 , s 2 , ..., s |p| ] and O p = [o s 1 , o s 2 , ..., o s |p| ], respectively, where |p| denotes the number of sentences in p.
Given a shuffled sentence sequence, our TGCM outputs a sentence sequenceŜ p whose order is denoted byÔ p . The objective of coherence modeling is to make the coherence probability for the predicted orderÔ p approximate to the coherence probability for the correct order O p as follows: where P (O p |p) and P (Ô p |p) denote the coherence probabilities for O p andÔ p , respectively.

Topic Latent Vectors
Given the number of hidden topics, which is denoted by d t , the preprocessing of TGCM trains a topic model on a given corpus as shown in Figure 2. The topic model builds a generating function topicDistribution(·) based on the probability distribution over topics for each word. At test time, topicDistribution(·) infers the d tdimensional topic latent vectors t doc (i.e., topic distribution) of a new, unseen document doc such as sentences and paragraphs. To do this, we utilize LDA (Blei et al., 2003), which is the most simple and popular algorithm for topic modeling. The topic latent vectors of sentences and paragraphs indicate the sentencelevel (local) and the paragraph-level (global) topical context via hidden topics, respectively.

Topic-Sensitive Sentence Encoder
Attention-Based bi-LSTM Layer. We leverage an extended version of LSTMs (Hochreiter and Schmidhuber, 1997) as our base model in Figure  3, which overcomes the gradient vanishing of existing RNNs. Particularly, attention-based bidirectional LSTMs (Att-BLSTMs) (Zhou et al., 2016) are mostly used in sentence modeling.
In this layer, we aim to produce the sentence vector s from a given sentence s consisting of a word sequence [w 1 , w 2 , ..., w n ]. Each word w i is encoded into a word embedding w i ∈ R dw . Att-BLSTMs consist of two sub-networks with forward and backward LSTMs, which take the word embeddings [w 1 , w 2 , ..., w n ] and output sequences of forward and backward hidden states where M t is a set of learnable parameters. Then, we combine the forward and backward hidden states by an element-wise sum as follows: Thus, the hidden state vectors [h 1 , h 2 , ..., h n ] ∈ R dw×n are obtained from previous LSTM layers and fed into an attention layer as follows:  where m 1 ∈ R dw and α ∈ R n denote a weight vector and resulting attention weights, respectively. s ∈ R dw is the final sentence vector of s, which is computed by a weighted sum between the hidden state vectors and the attention weights.
Topic-Sensitive Sentence Vectors. As shown in Figure 3, our topic-sensitive sentence encoder takes a shuffled paragraph p as input, which consists of unordered sentences {s i } |p| 1 . We create sentence vectors s i for all sentences s i with the attention-based bi-LSTMs. In addition, topic latent vectors t p ∈ R dt and t s i ∈ R dt are generated by topicDistribution(·) for p and s i , where d t denotes the number of latent topics.
During the encoding of TGCM, we multiply t s i and t p for each sentence s i to weight the coherent latent topics that appear simultaneously in both the sentence and the paragraph. The topic latent vector t s i captures how topical a sentence s i is in itself (local context), whereas the topic latent vector t p representes the overall theme of a paragraph p (global context). Thus, the encoder can extract the paragraph-level context relating to each sentence by enriching the context of the sentence with its topical relevance to the paragraph as follows: (t p ⊗ t s i ) ∈ R dt ; s i ∈ {s 1 , s 2 , ..., s |p| }, (5) where the operation ⊗ indicates an element-wise multiplication. To combine t p ⊗ t s i with the sentence vectors s i , we apply two linear transformations with a ReLU activation to each t p ⊗ t s i separately. Then, the transformed vectors are added with the corresponding sentence vectors as follows: wheres i ∈ R dw denotes a topic-sensitive sentence vector. Thus, the global topical context relating to the local topical context of each sentence is incorporated into the corresponding sentence vector. This helps our pointer decoder for next-sentence predictions to guide the semantic connection between a predicted sentence and its next sentence by considering coherent topical context flow.

Attentive Pointer Decoder
As shown in Figure 4, we build the attentive pointer decoder based on the decoder of Transformer (Zhou et al., 2016) and the pointer network (Narayan et al., 2018b). The pointer network determines next token vectors iteratively via content-based attention over input token vectors. For coherent topical context matching, our decoder is mainly composed of a stack of n attention modules relying entirely on attention which have been attracted as a promising technique in many sequence-based tasks (Bahdanau et al., 2014).
Existing SSO models utilized RNN-based decoders cannot pass encoded sentence vectors because their decoders sequentially decode one paragraph vector. Attention mechanisms allow each sentence in a different position to link other sentences regardless of the order and number of input sentences. Thus, our attention based pointer decoder directly utilize the topic-sensitive sentence vectors {s 2 ,s |p| ,s 3 , ...,s 1 } in a permutationagnostic manner (global information). This allows our model to avoid bottleneck problems which dilute word/sentence semantics between encoder and decoder and fully utilize the semantics for sentence ordering.
Moreover, Our attentive pointer decoder can capture topic-enhanced sentence-pair interactions between the current predicted sentence and  Figure 4: Architecture of attentive pointer decoder. The decoder fetches permutation-agnostic topic-sensitive sentence vectors {s 2 ,s |p| ,s 3 , ...,s 1 } from the encoder. For coherent topical context modeling, at the 3rd step, a stack of n attention modules position-wisely takes {→, s 1 , s 2 } and newly created topic latent vectors {t ∅ , t s1 , t s1,s2 }, where t ∅ denotes a zero vector. The topic vectors are generated by incrementally including previously predicted {s 1 , s 2 }, The attention layer outputs a probabilistic distribution over encoder outputs {s 2 ,s |p| ,s 3 , ...,s 1 } and select the next sentence s 3 . encoder outputs {s 2 ,s |p| ,s 3 , ...,s 1 }) as nextsentence candidates. With the coherent topical context (local information) matching, we promote local dependencies for identifying the tight semantic connections for sentence ordering.
Coherent Topical Context Matching. Our attentive pointer decoder employs a stack of attention modules identical with the decoder of Transformer (Zhou et al., 2016) for the coherent topical context matching. Given query Q ∈ R n×dm , key K ∈ R n×dm , and value V ∈ R n×dm , the attention mechanism computes the output matrix Out att ∈ R n×dm obtained from the value matrix V with an attention weight α as follows: where α is calculated with the query-key pair by the scaled dot product. Note that self-attention is the case when all query, key, and value matrices are the same. Masked attention, a variant of attention, masks out all positions after the current position by arbitrarily setting a large value (−∞) in the softmax function. Firstly, for the coherent topical context matching, we position-wisely feed previously predicted sentence vectors and their newly created topic latent vectors to the masked multi-head selfattention sub-layer. In that, each topic latent vector is generated by topicDistribution(·) with predicted sentences in all positions before the current position. Thus, we incorporate the topical context flow of the previously predicted sentence sequence into the corresponding predicted sentence vector.
After a residual connection  and layer normalization (Lei Ba et al., 2016), the resulting vectors are injected as the query matrix Q into the multi-head attention sub-layer. The sub-layer also takes permutation-agnostic topicsensitive sentence vectors obtained from our encoder as the key and value matrices K and V. We follow Transformer for the remaining coherent topical context matching process, including multihead attention strategy. As a result, with the multiple attention modules, TGCM draws global dependencies among sentences by attending over the topic-sensitive sentence vectors repeatedly.
Probabilistic Distribution for Ordering. At the ith step, the attention module of the decoder takes previously predicted sentences {s 1 , ...,s i−1 } and s 0 for the token "→" as inputs. After repeating the attention module n times, we then utilize the d w -dimensional output vector c i−1 corresponding to the (i − 1)th decoder position. As shown in Figure 4, at the 3rd step, the output vector (c 3 ) of the 3rd position is fed into the final attention layer as like in conventional pointer decoders.
In the final attention layer, for the output vector c i of the attention modules, TGCM produces an output distribution over topic-sensitive sentence vectors {s 1 ,s 2 , ...,s |p| } obtained from encoder via content-based attention as follows: whereô i denotes the order of the ith position. W a ∈ R d×2d v a ∈ R d denote the weight matrix and vector, respectively, which are shared in all positions. Thus, we can select the correct next  sentence s i that yields the highest probability from the output distribution. Finally, the selected sentence vector s i is fed into the decoder along with previously selected sentences {s 1 , s 2 , ..., s i−1 } and newly created topic latent vectors {t ∅ , t s 1 , t s 1 ,s 2 , ..., t s 1 ,...,s i } as inputs at the (i + 1)th step.
Training. For each correct sentence sequence, we sample 5 shuffled sentence sequences at training phase. Following the typical sentence ordering models, we train parameters of our model to maximize the coherence probability by minimizing the loss function as follows.
where Θ and λ denote a set of trainable parameters and a regularization parameter, respectively.

Experimental Setup
The parameters of TGCM were tuned by the RM-SProp 1 optimizer, which adaptively adjusts the learning rate for each parameter and resolves the problem of Adagrad (Duchi et al., 2011) where the learning rate radically decreases or increases. All experiments were implemented in Python using TensorFlow (Abadi et al., 2016), which supports the GPU-accelerated deep learning. We also utilized Natural Language Toolkit 2 (NLTK) (Loper and Bird, 2002) for sentence tokenization and data preprocessing, and Gensim 3 for LDAbased topic modeling. The word embeddings were initialized with pre-trained GloVe vectors of dimension d w = 200, trained by GloVe (Pennington et al., 2014). We trained our LDA-based topic model for each dataset with its training, valid, and test corpora.

Datasets
We first conducted our experiments on commonly used 4 abstract datasets: NIPS, ANN , NSF, and arXiv , which contain abstracts of research papers. The abstracts datasets collected from NIPS papers, ACL Anthology Network (AAN) corpus, NSF Research Award abstract, and arXiv website, respectively. Abstracts have the logical consistency of high quality, which helps coherence modeling for sentence ordering. We further considered the Sequential Image Narrative Dataset (SIND) , which consists of personal multimodal stories consisting of five sentences per story. We did not utilize accident and earthquake datasets (Barzilay and Lapata, 2008) because the datasets are too small and do not provide the validation set. Following the setup in (Logeswaran et al., 2018), we split undivided datasets for training, validation, and test. Table 1 shows the details of the 5 datasets used in our experiments.

Evaluation Metrics
Following the evaluation metrics widely used in previous SPO and SSO models, we adopted three metrics to assess sentence ordering performance: Kendall's tau, perfect match ratio, and positional accuracy.
Kendall's tau (τ ): Kendall's tau (Lapata, 2003(Lapata, , 2006 measures the ordinal association between two sequences to automatically evaluate coherence modeling as follows: where |D| and |p i | denote the total number of paragraphs in a test dataset D and the number of sentences in a paragraph p i , respectively. The function inversions(O p i ,Ô p i ) returns the number of sentence-pair interchanges for reconstructing the correct order O p i from the predicted orderÔ p i . The value ranges from -1 to 1. A higher value indicates better performance. The evaluation metric closely correlates with user ratings and reading times, which are related with the readability.
Perfect Match Ratio (PMR in %): The perfect match ratio is the ratio of exactly matching orders across all predicted paragraphs as follows: where I(O p i =Ô p i ) denotes the indicator function, which returns 1 if the correct sentence order O p i and the predicted sentence orderÔ p i are identical and 0 otherwise.
Positional Accuracy (PAcc in %): The positional accuracy is defined as the ratio of the matched sentence-level absolute positions between the predicted and correct orders as follows: where o s j denotes the absolute position of sentence s j in the correct sequence S j . Likewise,ô s j denotes the absolute position of sentence s j in the predicted sequenceŜ j . I(o s j =ô s j ) denotes the indicator function equal to 1 if o s j =ô s j and 0 otherwise.

Hyperparameters
For LDA-based topic modeling, we decided its hyperparameters β and d t with a grid search on each dataset. Following Griffiths and Steyvers (2004), we kept α = 0.1 and β = 50/d t constantly and obtained the best results with d t = 300 for all abstract datasets and d t = 200 for SIND. We initially configured d w = 200 with the Glove vectors and updated the word vectors during training. For our attentive pointer decoder, we followed the same hyperparameters in Transformer and used a learning rate of 0.01. Other parameters were initialized randomly based on He et al. (2015).

Experimental Results
This section reports experimental results on the sentence ordering task for determining a coherent order of a given sentence. The proposed TGCM was compared with state-of-theart methods as baselines such as Pairwise Model , Seq2seq (Logeswaran et al., 2018), RNN Decoder (Logeswaran et al., 2018), V-LSTM+PtrNet (Logeswaran et al., 2018), CNN+PtrNet (Gong et al., 2016), LSTM+PtrNet (Gong et al., 2016), and ATTOrderNet (Cui et al., 2018). Here, except the random model, all of the baselines are based on neural networks, which are typically more competitive than traditional approaches (e.g., utilizing handcraft features). For the LDA-based topic model in our preprocessing, we obtained the topic distributions of words by learning their relative importance for each topic.   We could infer the topic latent vectors of new documents at the test phases. Given a shuffled sentence sequence, the main goal is to find the most coherent sentence sequence. The coherence probability of a predicted sentence sequence is approximated to the coherence probability of a correct sentence sequence. In Table 2-3, the several values of some models are directly taken from (Cui et al., 2018), while we implemented the rest of models using their public code with the same experimental setup. Some methods do not release their implemented codes and so were implemented. In the case of the latest model ATTOderNet, we only utilized the highperformance model among three versions.
We discuss the reason for our highest performance among previous models by classifying the factors affecting performance into four categories: sentence set modeling, topic latent vectors, permutation invariance, and attention at decoder.

Impact of Sentence Set Modeling
This is related to the global dependencies of an entire paragraph. We observed that the performances of the random model and the pairwise model  were the worst. The pairwise model only learns the relative order from sentence-pair interactions. Since the two models do not consider all sentences at a time, they cannot leverage paragraph-level information. In other words, they cannot capture the global dependencies that help sentence ordering.

Impact of Topic Latent Vectors
Furthermore, we evaluated the variant version of TGCM, referred to TGCM-S, where the encoder of TGCM-S do not combine topic latent vectors and original sentence vectors. Also, the decoder of TGCM-S do not take newly created topic latent vectors. This allows us to explore whether the topic latent vectors actually help the sentence ordering task. Since TGCM-S cannot perform coherent topical context matching, TGCM-S do not elaborately capture the local dependencies, which help to identify tight semantic connections between the current predicted sentence and each next-sentence candidate.
As shown in the Tables 2-3, full TGCM outperformed TGCM-S which employs only attentionbased bi-LSTMs without topic modeling. Although TGCM-S do not utilize topical context features at encoder and decoder (such as t p and t s ), TGCM-S also employs the transformer-based attentive pointer decoder. Thus, we found that topic modeling additionally contributes to modeling sentences through global topical context relating to local topical context of each sentence, resulting in improving the sentence ordering performance.

Impact of Permutaion Invariance
Compared to ATTOrderNet and our TGCM, the performance of other models was relatively low.
The conventional models typically adopt hierarchical RNN-based encoders, which combine sentence vectors sequentially and generate paragraphlevel representation as a vector. Since they are dependent on the permutation of the input sentences, their paragraph-level representations are not reliable.
On the other hand, self-attention-based AT-TOrderNet (encoder side) and attention-based our TGCM/TGCM-S (decoder side) use a set of sentence representations in a permutation-invariant manner rather than a single vector to represent a paragraph. Thus, these models can capture global dependencies regardless of the order of input sentences. The outputs of their encoder are more informative and reliable for improving the sentence ordering performance.

Impact of Attention at Decoder
This is related to solving the bottleneck problem by enhancing the expressiveness of the decoder's input and capturing the global and local dependencies among input sentences simultaneously. Table 2-3 show that ATTOrderNet and TGCM, which utilize an attention mechanism, perform much better than all the other models. With an encoder employing self-attention mechanism, AT-TOrderNet can capture some global dependencies with the reliable paragraph-level representation regardless of their input order. However, since AT-TOrderNet also compresses sentence vectors into a single fixed-length paragraph vector via an average pooling layer, the word/sentence-level semantics from the encoder to the decoder are diluted.
In contrast to ATTOrderNet, TGCM feeds the sentence vectors directly into a decoder without compressing them into a single vector. Then, the attention mechanism allows our decoder to receive and utilize the sentence vectors, regardless of the order and number of sentences. As a result, since our TGCM could elaborately capture both local and global dependencies simultaneously, TGCM showed the best sentence ordering performance among all comparative models on all datasets.

Conclusions
We propose TGCM which better preserves local and global information for sentence ordering. Our attentive pointer decoder fully utilizes the semantics of sentence vectors without being compressed into a paragraph vector. Our sentence encoder produces topic-sensitive sentence vectors via topic modeling. With the coherent topical context matching between the current predicted sentence and each next-sentence candidate, we promote local dependencies that help identify the tight semantic connections. The empirical results on sentence ordering demonstrate that TGCM outperforms state-of-the-art models.