Context Tracking Network: Graph-based Context Modeling for Implicit Discourse Relation Recognition

Implicit discourse relation recognition (IDRR) aims to identify logical relations between two adjacent sentences in the discourse. Existing models fail to fully utilize the contextual information which plays an important role in interpreting each local sentence. In this paper, we thus propose a novel graph-based Context Tracking Network (CT-Net) to model the discourse context for IDRR. The CT-Net firstly converts the discourse into the paragraph association graph (PAG), where each sentence tracks their closely related context from the intricate discourse through different types of edges. Then, the CT-Net extracts contextual representation from the PAG through a specially designed cross-grained updating mechanism, which can effectively integrate both sentence-level and token-level contextual semantics. Experiments on PDTB 2.0 show that the CT-Net gains better performance than models that roughly model the context.


Introduction
Implicit discourse relation recognition (IDRR) aims to identify logical relations between two adjacent sentences in discourse without the guidance of connectives (e.g., because, but), which is one of the major challenges in discourse parsing. With the rise of deep learning, lots of sentence-modeling based methods (Liu and Li, 2016;Rönnqvist et al., 2017;Bai and Zhao, 2018;Xu et al., 2019;Shi and Demberg, 2019) have emerged in the field of IDRR. These methods typically focus on modeling the local semantics of these two sentences, without considering wider discourse context.
Contextual information plays an important role in understanding sentences. Take the paragraph P = {S 1 , S 2 , S 3 , S 4 } in Figure 1 as an example, the ground-truth relation between S 3 and S 4 is "Comparison". Combining the contextual information carried by S 1 and S 2 , we can more easily identify the "Comparison" relation reflected by The manufacturer went public at $15.75 a share in August 1987.
Mr. Sim's goal then was a $29 per-share price.
Strong earnings growth helped achieve that price far ahead of schedule.
The stock has since softened, trading around $25 a share "achieve that price" (rising: "$15.75 a share" to "$29 per-share") and "softened" (falling: "$29 pershare" to "$25 a share"). Dai and Huang (2018) move one step on utilizing wider discourse context, where they use a hierarchical BiLSTM (H-LSTM) to model the whole paragraph rather than only the two sentences, to obtain context-aware sentence representation. However, there are still two limitations in their model. First, they roughly merge all the information in the paragraph, which dilutes the role of key context that closely related to the current sentence. Second, the H-LSTM suffers from the long-distance forgetting problem, which may fail to model the long-distance and non-continuous dependency across multiple sentences (like green lines in Figure 1). To overcome these limitations, we propose a novel Context Tracking Network (CT-Net), which can track essential context for each sentence from the intricate discourse, without being affected by the spatial distance. The CT-Net computes contextual representation through two main steps. Firstly, it converts the paragraph into the paragraph association graph (PAG) (Figure 1), which contains three types of edges between sentences, namely (1) adjacency edge (black lines): connecting adjacent sentences, (2) co-reference edge (purple lines): connecting sentences with co-reference associations, and (3) lexical chain edge (green lines): connecting sentences containing related words. Each sentence can track closely related context along these (,( ℎ (,* ℎ (,|,-| ℎ (,. ℎ *,( ℎ *,* ℎ *,|,/| ℎ *,0 ℎ 1,( ℎ 1,* ℎ 1,|,2| ℎ 1,. ℎ 3,( ℎ 3,* ℎ 3,|,4| ℎ 3,. … … … … ! ! Figure 2: The overall architecture of the CT-Net. Given a paragraph P = (S 1 , S 2 , S 3 , S 4 ), it converts P into the PAG G, then employs the cross-grained updating mechanism on G to get contextual representation for classification. edges, including long-distance sentences involving the same object or topic. Secondly, the CT-Net extracts contextual representation over the PAG. To effectively incorporate fine-grained information carried by tokens, we propose the cross-grained updating mechanism, which will be executed multiple recurrent rounds. At each round, it performs semantic exchange via three processes: • Token-to-Sentence Updating: updating the sentence representation with its tokens to grasp fine-grained semantics.
• Sentence-to-Sentence Updating: performing interaction between sentences on the PAG to get context-aware sentence representation.
• Sentence-to-Token Updating: using the context-aware sentence representation to update tokens, so that each token can also incorporate contextual information. The obtained context-aware token representation will be used for the computation of the next round.
After multiple rounds, the CT-Net obtains the contextual representation that fully combines sentencelevel and token-level contextual semantics. Our main contributions are two folds. 1 First, we propose a novel CT-Net for IDRR, which builds the PAG to track closely related context for each sentence in the intricate discourse, and incorporates multi-grained contextual semantics via the crossgrained updating mechanism. Second, experiments on PDTB 2.0 demonstrate that the CT-Net gains better performance than a variety of approaches that roughly model the discourse context. 1 Code is available at: https://github.com/ yxuezhang/CTNet

Model
The input of the CT-Net is a paragraph P = (S 1 , S 2 , ..., S n−1 , S n ). Here, S n−1 and S n are the adjacent sentences to be classified, while S 1 , ..., S n−2 are context with background information. Our goal is to identify the relation between S n−1 and S n . We firstly build a paragraph association graph (PAG) for P (Section 2.1), then employ the cross-grained updating mechanism on the PAG to extract the contextual representation of S n−1 and S n (Section 2.2). The contextual representation is then used for the final classification (Section 2.3).

Paragraph Association Graph
The CT-Net firstly converts the P into a PAG G = (V, E), where V and E are the sets of nodes and edges respectively. As shown in Figure 2, the PAG contains sentence nodes (blue) and token nodes (orange). Each token node is connected with its corresponding sentence node. We carefully design the edges between sentence nodes so that each sentence only connects the ones that are closely related to it. Specifically, there are three types of edges between sentence nodes in the PAG: • Adjacency Edge (black edges). Adjacent sentences tend to carry important contextual information. Therefore, we add adjacency edges between the neighbors in the discourse.
• Co-reference Edge (purple edges). Sentences with co-reference associations tend to involve the same object and be highly related, so we add a co-reference edge between them.
• Lexical Chain Edge (green edges). Lexical chain tracks related words that run through the whole paragraph. Sentences containing the same words or synonyms (except stop words) tend to involve the same topic, therefore, we add a lexical chain edge between them.
We give more details of the PAG in Section 3.2.

Cross-Grained Updating Mechanism
The CT-Net then extracts contextual representation of S n−1 and S n from the PAG G through crossgrained updating mechanism, which is executed T rounds. At the t-th round, we denote the state of the i-th sentence node as g t i , and the state of the j-th token node of the i-th sentence as h t i,j . The states transition from the (t-1)-th to the t-th round consists of three computation processes: token-tosentence updating, sentence-to-sentence updating and sentence-to-token updating. The first two processes are responsible for updating sentence nodes, while the last one is for updating token nodes.
Node Initialization. When t = 0, we initialize token nodes with the concatenation of char, GloVe (Pennington et al., 2014) and ELMo (Peters et al., 2018) embeddings. And the dimension is reduced: where W , b are parameters. The sentence node g 0 i is initialized as the average of its token nodes. Token-to-Sentence Updating. This process updates the sentence state g t i with the token states of last round h t−1 i,j . We employ Sentence-state LSTM (SLSTM) (Zhang et al., 2018) to achieve this. SLSTM is a novel graph RNN that converts a sentence into a graph with one global sentence node and several local word nodes, just like the sub-graph in the PAG (inside the dotted ellipse in Figure 2). At the t-th round, the hidden state of i-th sentence g t i is computed as follows: where SLSTM h→g represents the process of updating the sentence state with token states by SLSTM, and its detailed equations are shown in Appendix A. |S i | is the number of tokens in S i .
Sentence-to-Sentence Updating. After merging token semantics, sentences further grasp sentence-level contextual semantics through the interaction between sentence nodes on the PAG. Since there are three types of edges, we employ Multi-Relational GCN (Schlichtkrull et al., 2018) to get contextual sentence representation c t i of S i : where W g , W r are model parameters. R is the set of edge types between sentence nodes. N r i denotes neighbours of the i-th sentence node of relation r, where r ∈ R. σ is the ReLU function.
Sentence-to-Token Updating. This process is for updating token states. It conveys the sentencelevel contextual information c t−1 i to the token, which is also achieved by the SLSTM. At the tth round, the hidden state of each token h t i,j is computed as follows: where x i,j is the initial token embedding. We show the detailed equations of SLSTM g→h in Appendix A. Then, the obtained h t i,j is used for the token-to-sentence updating of the next round.
After T rounds, we get c T n−1 and c T n as the final contextual representations of S n−1 and S n , respectively, which fully combine token-level and sentence-level contextual semantics.

Classification Layer
After obtaining global contextual representations c T n−1 and c T n , we use a one-layer BiLSTM (Hochreiter and Schmidhuber, 1997) to encode S n−1 into l n−1 by concatenating the last hidden states in two directions, and encode S n into l n in the same way. l n−1 and l n are local representations without considering wider context. We then concatenate global and local features as follows: X cls is then fed into a two-layer MLP (a fullyconnected layer with ReLU activation followed by a softmax output layer) for classification.
Multi-Task Training. Following previous works (Dai and Huang, 2018; Nguyen et al., 2019), we apply multi-task learning to improve the performance. The main task is implicit discourse relation recognition (IDRR), while the auxiliary tasks are explicit discourse relation recognition (EDRR) and connective prediction (CP). These three tasks share the same encoder but use three different MLPs. The objective function is as follows: where α, β, γ are adjustable hyper-parameters. y idrr , y edrr and y cp are ground-truth labels of IDRR, EDRR and CP respectively, while y idrr , y edrr and y cp are corresponding predictions. C idrr , C edrr and C cp represent the number of classes of IDRR, EDRR, and CP respectively. The metric is F1 score, and for 4-way classification, we calculate the macro-average F1 score.

Implementation Details
Details of the PAG. We set the number of sentences to build PAGs as 6, and use zero padding when the text is less than 6 sentences. When building the PAG, we employ spaCy (https: //spacy.io/) to identify co-reference chains, use simple matching to recognize the same words and use WordNet (Miller, 1995)

Results and Discussion
Main Results (Table 1). We carefully design four baselines with different paragraph encoders for a full comparison: (1) "NoContext", the model only using BiLSTM to get local features without considering wider context. (2) "BiLSTM", the model using BiLSTM to encode the paragraph.
(3) "H-LSTM", the model using hierarchical BiL-STM as paragraph encoder. (4) "FCG-Net", the model replacing the PAG in the CT-Net with a fully-connected graph (FCG). Except for the way of encoding paragraph, the other settings of these models are the same as the CT-Net. We can draw the following three conclusions. First, "NoContext" obtains the worst performance in most cases, demonstrating the necessity of using contextual representations. Second, the CT-Net gains better performance than models with sequential paragraph encoders "BiLSTM" and "H-LSTM", which proves the superiority of our graph-based CT-Net. The reason is that the CT-Net can track and model closely related context for sentences including longdistance ones. Third, replacing the PAG in the CT-Net with the FCG (FCG-Net) brings a quality drop, which proves the PAG effectively pick out appropriate context that benefits on sentence understanding. We also performed paired t-test between CT-Net and these 4 baselines. The CT-Net is significantly   better than all these baselines with p < 0.05. Analysis of the PAG ( Table 2). The PAG contains three types of edges: adjacency edge (Adj.), co-reference edge (Coref.) and lexical chain edge (Lex.). To understand the impact of these edges, we conduct ablation experiments on 4-way classification. Rows 1-3 report the results of removing "Adj.", "Coref.", and "Lex." respectively. Removing "Adj." brings the biggest drop (0.97%), which reflects that the adjacency edge plays the most important role in the PAG. We also explore the impact of the number of sentences in the PAG. Rows 4-6 report the results. The CT-Net gains the best performance when the PAG contains 6 sentences, and modeling a longer paragraph of 8 sentences causes a decline. We hypothesize that modeling a paragraph this is too long may introduce some irrelevant context, resulting in a reduction in performance.
Comparison with Existing Systems (Table 3). Table 3 shows the comparison with existing systems. Our method outperforms other models on 4-way classification, and also gains the best performance on the binary classifications of temporal (Temp.) and expansion (Exp.).
Ablation Study of Multi-task Learning (Table 4). Following Dai and Huang (2018) and Nguyen et al. (2019), we utilize the explicit discourse relation recognition (EDRR) and connective prediction (CP) as auxiliary tasks to help implicit discourse relation recognition (IDRR). We conduct ablation experiments of the two auxiliary tasks on 4-way classification (Table 4) to show their impact. Row 1 is the performance of the CT-Net. Rows 2-3 report the performance of removing the auxiliary task. As expected, the EDRR contributes more to the IDRR than the CP does, which is because that the EDRR is a more similar task with the IDRR.

Conclusion
We propose a novel graph-based Context Tracking Network (CT-Net) to model the context for implicit discourse relation classification. The CT-Net first converts the paragraph into the paragraph association graph (PAG), where each sentence tracks their appropriate context through different edges, then employs the cross-grained updating mechanism to combine sentence-level and token-level contextual information. Experiments on PDTB 2.0 demonstrate that the CT-Net captures more effective contextual information than carefully designed baselines with different context encoders.
where W * , U * and b * are model parameters, here, * ∈ {g, f, o}. |S i | is the number of tokens of the i-th sentence. f t i,0 , ..., f i,|S i | and f t g i are gates controlling information from v t−1 i,0 , ..., v t−1 i,|S i | , v t−1 g i , respectively. o t i is an output gate from the recurrent cell v t g i to g t i . F s represents the softmax function. SLSTM g→h . At the t-th round, the hidden state of each token h t i,j is computed based on the initial input x i,j , its hidden state of last round h t−1 i,j , the hidden states of its neighbors of last round h t−1 i,j−1 , h t−1 i,j+1 and the contextual representation c t−1 i .
where W * , U * and b * are model parameters, here, * ∈ {i, l, r, f, s, o}. F s represents the softmax function, and σ represents the sigmoid function. i t i,j , l t i,j , r t i,j , r t i,j , f t i,j are gates conveying information from the ε t i,j and x i,j to the cell state v t i,j , which are normalised. o i t is an output gate from the cell v t i,j to the hidden state h t i,j .