Recognizing Implicit Discourse Relations via Repeated Reading: Neural Networks with Multi-Level Attention

Recognizing implicit discourse relations is a challenging but important task in the field of Natural Language Processing. For such a complex text processing task, different from previous studies, we argue that it is necessary to repeatedly read the arguments and dynamically exploit the efficient features useful for recognizing discourse relations. To mimic the repeated reading strategy, we propose the neural networks with multi-level attention (NNMA), combining the attention mechanism and external memories to gradually fix the attention on some specific words helpful to judging the discourse relations. Experiments on the PDTB dataset show that our proposed method achieves the state-of-art results. The visualization of the attention weights also illustrates the progress that our model observes the arguments on each level and progressively locates the important words.


Introduction
Discourse relations (e.g., contrast and causality) support a set of sentences to form a coherent text. Automatically recognizing discourse relations can help many downstream tasks such as question answering and automatic summarization. Despite great progress in classifying explicit discourse relations where the discourse connectives (e.g., "because", "but") explicitly exist in the text, implicit discourse relation recognition remains a challenge due to the absence of discourse connectives. Previous research mainly focus on exploring various kinds of efficient features and machine learning models to classify the implicit discourse relations (Soricut and Marcu, 2003;Baldridge and Lascarides, 2005;Subba and Di Eugenio, 2009;Hernault et al., 2010;Pitler et al., 2009;Joty et al., 2012). To some extent, these methods simulate the single-pass reading process that a person quickly skim the text through one-pass reading and directly collect important clues for understanding the text. Although single-pass reading plays a crucial role when we just want the general meaning and do not necessarily need to understand every single point of the text, it is not enough for tackling tasks that need a deep analysis of the text. In contrast with single-pass reading, repeated reading involves the process where learners repeatedly read the text in detail with specific learning aims, and has the potential to improve readers' reading fluency and comprehension of the text (National Institute of Child Health and Human Development, 2000;LaBerge and Samuels, 1974). Therefore, for the task of discourse parsing, repeated reading is necessary, as it is difficult to generalize which words are really useful on the first try and efficient features should be dynamically exploited through several passes of reading . Now, let us check one real example to elaborate the necessity of using repeated reading in discourse parsing. Arg-1 : the use of 900 toll numbers has been expanding rapidly in recent years Arg-2 : for a while, high-cost pornography lines and services that tempt children to dial (and redial) movie or music information earned the service a somewhat sleazy image (Comparison -wsj 2100) To identify the "Comparison" relation between the two arguments Arg-1 and Arg-2, the most crucial clues mainly lie in some content, like "expanding rapidly" in Arg-1 and "earned the service a somewhat sleazy image" in Arg-2, since there exists a contrast between the semantic meanings of these two text spans. However, it is difficult to obtain sufficient information for pinpointing these words through scanning the argument pair left to right in one pass. In such case, we follow the repeated reading strategy, where we obtain the general meaning through reading the arguments for the first time, re-read them later and gradually pay close attention to the key content.
Recently, some approaches simulating repeated reading have witnessed their success in different tasks. These models mostly combine the attention mechanism that has been originally designed to solve the alignment problem in machine translation (Bahdanau et al., 2014) and the external memory which can be read and written when processing the objects (Sukhbaatar et al., 2015). For example, Kumar et al. (2015) drew attention to specific facts of the input sequence and processed the sequence via multiple hops to generate an answer. In computation vision, Yang et al. (2015) pointed out that repeatedly giving attention to different regions of an image could gradually lead to more precise image representations.
Inspired by these recent work, for discourse parsing, we propose a model that aims to repeatedly read an argument pair and gradually focus on more fine-grained parts after grasping the global information. Specifically, we design the Neural Networks with Multi-Level Attention (NNMA) consisting of one general level and several attention levels.
In the general level, we capture the general representations of each argument based on two bidirectional long short-term memory (LSTM) models. For each attention level, NNMA generates a weight vector over the argument pair to locate the important parts related to the discourse relation. And an external short-term memory is designed to store the information exploited in previous levels and help update the argument representations. We stack this structure in a recurrent manner, mimicking the process of reading the arguments multiple times. Finally, we use the representation output from the highest attention level to identify the discourse relation. Experiments on the PDTB dataset show that our proposed model achieves the state-of-art results.

Repeated Reading Neural Network with Multi-Level Attention
In this section, we describe how we use the neural networks with multi-level attention to repeatedly read the argument pairs and recognize implicit discourse relations. First, we get the general understanding of the arguments through skimming them. To implement this, we adopt the bidirectional Long-Short Term Memory Neural Network (bi-LSTM) to model each argument, as bi-LSTM is good at modeling over a sequence of words and can represent each word with consideration of more contextual information. Then, several attention levels are designed to simulate the subsequent multiple passes of reading. On each attention level, an external short-term memory is used to store what has been learned from previous passes and guide which words should be focused on. To pinpoint the useful parts of the arguments, the attention mechanism is used to predict a probability distribution over each word, indicating to what degree each word should be concerned. The overall architecture of our model is shown in Figure 1. For clarity, we only illustrate two attention levels in the figure. It is noted that we can easily extend our model to more attention levels.

Representing Arguments with LSTM
The Long-Short Term Memory (LSTM) Neural Network is a variant of the Recurrent Neural Network which is usually used for modeling a sequence. In our model, we adopt two LSTM neural networks to respectively model the two arguments: the left argument Arg-1 and the right argument Arg-2.
First of all, we associate each word w in our vocabulary with a vector representation x w ∈ R De . Here we adopt the pre-trained vectors provided by GloVe (Pennington et al., 2014). Since an argument can be viewed as a sequence of word vectors, let x 1 i (x 2 i ) be the i-th word vector in argument Arg-1 (Arg-  2) and the two arguments can be represented as, where Arg-1 has L 1 words and Arg-2 has L 2 words. To model the two arguments, we briefly introduce the working process how the LSTM neural networks model a sequence of words. For the i-th time step, the model reads the i-th word x i as the input and updates the output vector h i as follows (Zaremba and Sutskever, 2014). (1) where [ ] means the concatenation operation of several vectors. i, f , o and c denote the input gate, forget gate, output gate and memory cell respectively in the LSTM architecture. The input gate i determines how much the input x i updates the memory cell. The output gate o controls how much the memory cell influences the output. The forget gate f controls how the past memory c i−1 affects the current state.
Referring to the work of Wang and Nyberg (2015), we implement the bidirectional version of LSTM neural network to model the argument sequence. Besides processing the sequence in the forward direction, the bidirectional LSTM (bi-LSTM) neural network also processes it in the reverse direction. As shown in Figure 1, using two bi-LSTM neural networks, we can obtain Next, to get the general-level representations of the arguments, we apply a mean pooling operation over the bi-LSTM outputs, and obtain two vectors R 1 0 and R 2 0 , which can reflect the global information of the argument pair.

Tuning Attention via Repeated Reading
After obtaining the general-level representations by treating each word equally, we simulate the repeated reading and design multiple attention levels to gradually pinpoint those words particularly useful for discourse relation recognition. In each attention level, we adopt the attention mechanism to determine which words should be focused on. An external short-term memory is designed to remember what has seen in the prior levels and guide the attention tuning process in current level. Specifically, in the first attention level, we concatenate R 1 0 , R 2 0 and R 1 0 −R 2 0 and apply a non-linear transformation over the concatenation to catch the general understanding of the argument pair. The use of R 1 0 −R 2 0 takes a cue from the difference between two vector representations which has been found explainable and meaningful in many applications (Mikolov et al., 2013). Then, we get the memory vector M 1 ∈ R dm of the first attention level as where W m,1 ∈ R dm×6d is the weight matrix.
With M 1 recording the general meaning of the argument pair, our model re-calculates the importance of each word. We assign each word a weight measuring to what degree our model should pay attention to it. The weights are so-called "attention" in our paper. This process is designed to simulate the process that we re-read the arguments and pay more attention to some specific words with an overall understanding derived from the first-pass reading. Formally, for Arg-1, we use the memory vector M 1 to update the representation of each word with a non-linear transformation. According to the updated word representations o 1 1 , we get the attention vector a 1 1 .
where h 1 ∈ R 2d×L 1 is the concatenation of all LSTM output vectors of Arg-1. e ∈ R L 1 is a vector of 1s and the M 1 ⊗ e operation denotes that we repeat the vector M 1 L 1 times and generate a d m × L 1 matrix. The attention vector a 1 1 ∈ R L 1 is obtained through applying a sof tmax operation over o 1 1 . W a,1 1 ∈ R 2d×2d , W b,1 1 ∈ R 2d×dm and W s,1 1 ∈ R 1×2d are the transformation weights. It is noted that the subscripts denote the current attention level and the superscripts denote the corresponding argument. In the same way, we can get the attention vector a 2 1 for Arg-2. Then, according to a 1 1 and a 2 1 , our model re-reads the arguments and get the new representations R 1 1 and R 2 1 for the first attention level.
Next, we iterate the "memory-attentionrepresentation" process and design more attention levels, giving NNMA the ability to gradually infer more precise attention vectors. The processing of the second or above attention levels is slightly different from that of the first level, as we update the memory vector in a recurrent way. To formalize, for the k-th attention level (k ≥ 2), we use the following formulae for Arg-1.
In the same way, we can computer o 2 k , a 2 k and R 2 k for Arg-2. Finally, we use the newest representation derived from the top attention level to recognize the discourse relations. Suppose there are totally K attention levels and n relation types, the predicted discourse relation distribution P ∈ R n is calculated as where W p ∈ R n×6d and b p ∈ R n are the transformation weights.

Model Training
To train our model, the training objective is defined as the cross-entropy loss between the outputs of the softmax layer and the ground-truth class labels. We use stochastic gradient descent (SGD) with momentum to train the neural networks.
To avoid over-fitting, dropout operation is applied on the top feature vector before the softmax layer. Also, we use different learning rates λ and λ e to train the neural network parameters Θ and the word embeddings Θ e referring to (Ji and Eisenstein, 2015). λ e is set to a small value for preventing overfitting on this task. In the experimental part, we will introduce the setting of the hyper-parameters.

Preparation
We evaluate our model on the Penn Discourse Treebank (PDTB) (Prasad et al., 2008). In our work, we experiment on the four top-level classes in this corpus as in previous work (Rutherford and Xue, 2015). We extract all the implicit relations of PDTB, and follow the setup of (Rutherford and Xue, 2015). We split the data into a training set (Sections 2-20), development set (Sections 0-1), and test set (Section 21-22). Table 1  We first convert the tokens in PDTB to lowercase. The word embeddings used for initializing the word representations are provided by GloVe (Pennington et al., 2014), and the dimension of the embeddings D e is 50. The hyper-parameters, including the momentum δ, the two learning rates λ and λ e , the dropout rate q, the dimension of LSTM output vector d, the dimension of memory vector d m are all set according to the performance on the development set Due to space limitation, we do not present the details of tuning the hyper-parameters and only give their final settings as shown in Table 2.  To evaluate our model, we adopt two kinds of experiment settings. The first one is the fourway classification task, and the second one is the binary classification task, where we build a onevs-other classifier for each class. For the second setting, to solve the problem of unbalanced classes in the training data, we follow the reweighting method of (Rutherford and Xue, 2015) to reweigh the training instances according to the size of each relation class. We also use visualization methods to analyze how multi-level attention helps our model.

Results
First, we design experiments to evaluate the effectiveness of attention levels and how many attention levels are appropriate. To this end, we implement a baseline model (LSTM with no attention) which directly applies the mean pooling operation over LSTM output vectors of two arguments without any attention mechanism.
Then we consider different attention levels including one-level, twolevel and three-level. The detailed results are shown in Table 3. For four-way classification, macroaveraged F 1 and Accuracy are used as evaluation metrics. For binary classification, F 1 is adopted to evaluate the performance on each class.  From Table 3, we can see that the basic LSTM model performs the worst. With attention levels added, our NNMA model performs much better. This confirms the observation above that one-pass reading is not enough for identifying the discourse relations. With respect to the four-way F 1 measure, using NNMA with one-level attention produces a 4% improvement over the baseline system with no attention. Adding the second attention level gives another 2.8% improvement. We perform significance test for these two improvements, and they are both significant under one-tailed t-test (p < 0.05). However, when adding the third attention level, the performance does not promote much and almost reaches its plateau. We can see that threelevel NNMA experiences a decease in F 1 and a slight increase in Accuracy compared to two-level NNMA. The results imply that with more attention levels considered, our model may perform slightly better, but it may incur the over-fitting problem due to adding more parameters. With respect to the binary classification F 1 measures, we can see  that the "Comparison" relation needs more passes of reading compared to the other three relations. The reason may be that the identification of the "Comparison" depends more on some deep analysis such as semantic parsing, according to (Zhou et al., 2010). Next, we compare our models with six state-ofthe-art baseline approaches, as shown in Table 4. The six baselines are introduced as follows.
• P&C2012: Park and Cardie (2012) designed a feature-based method and promoted the performance through optimizing the feature set.
• J&E2015: Ji and Eisenstein (2015) used two recursive neural networks on the syntactic parse tree to induce the representation of the arguments and the entity spans.
• Zhang2015: Zhang et al. (2015) proposed to use shallow convolutional neural networks to model two arguments respectively. We replicated their model since they used a different setting in preprocessing PDTB.
• R&X2014, R&X2015: Rutherford and Xue (2014) selected lexical features, production rules, and Brown cluster pairs, and fed them into a maximum entropy classifier. Rutherford and Xue (2015) further proposed to gather extra weakly labeled data based on the discourse connectives for the classifier.
• B&D2015: Braud and Denis (2015) combined several hand-crafted lexical features and word embeddings to train a max-entropy classifier.
• Liu2016: Liu et al. (2016) proposed to better classify the discourse relations by learning from other discourse-related tasks with a multitask neural network.
• Ji2016: Ji et al. (2016) proposed a neural language model over sequences of words and used the discourse relations as latent variables to connect the adjacent sequences.
It is noted that P&C2012 and J&E2015 merged the "EntRel" relation into the "Expansion" relation 1 . For a comprehensive comparison, we also experiment our model by adding a Expa.+EntRel vs Other classification. Our NNMA model with two attention levels exhibits obvious advantages over the six baseline methods on the whole. It is worth noting that NNMA is even better than the R&X2015 approach which employs extra data.
As for the performance on each discourse relation, with respect to the F 1 measure, we can see that our NNMA model can achieve the best results on the "Expansion", "Expansion+EntRel" and "Temporal" relations and competitive results on the "Contingency" relation . The performance of recognizing the "Comparison" relation is only worse than R&X2014 and R&X2015. As (Rutherford and Xue, 2014) stated, the "Comparison" relation is closely related to the constituent parse feature of the text, like production rules. How to represent and exploit these information in our model will be our next research focus.

Analysis of Attention Levels
The multiple attention levels in our model greatly boost the performance of classifying implicit discourse relations. In this subsection, we perform both qualitative and quantitative analysis on the attention levels.
First, we take a three-level NNMA model for example and analyze its attention distributions on different attention levels by calculating the mean Kullback-Leibler (KL) Divergence between any two levels on the training set. In Figure 3, we use kl ij to denote the KL Divergence between the i th and the j th attention level and use kl ui to denote the KL Divergence between the uniform distribution and the i th attention level. We can see that each attention level forms different attention distributions and the difference increases in the higher levels. It can be inferred that the 2 nd and 3 rd levels in NNMA gradually neglect some words and pay more attention to some other words in the arguments. One point worth mentioning is that Arg-2 tends to have more non-uniform attention weights, since kl u2 and kl u3 of Arg-2 are much larger than those of Arg-1. And also, the changes between attention levels are more obvious for Arg-2 through observing the values of kl 12 , kl 13 and kl 23 . The reason may be that Arg-2 contains more information related with discourse relation and some words in it tend to require focused attention, as Arg-2 is syntactically bound to the implicit connective.
At the same time, we visualize the attention levels of some example argument pairs which are analyzed by the three-level NNMA. To illustrate the k th attention level, we get its attention weights a 1 k and a 2 k which reflect the contribution of each word and then depict them by a row of color-shaded grids in Figure 2.
We can see that the NNMA model focuses on different words on different attention levels. Interestingly, from Figure 2, we find that the 1 st and 3 rd attention levels focus on some similar words, while the 2 nd level is relatively different from them. It seems that NNMA tries to find some clues (e.g. "moscow could be suspended" in Arg-2a; "won the business" in Arg-1b; "with great aplomb he considers not only" in Arg-2c) for recognizing the discourse relation on the 1 st level, looking closely at other words (e.g. "misuse of psychiatry against dissenters" in Arg-2a; "a third party that" in Arg-1b; "and support of hitler" in Arg-2c) on the 2 nd level, and then reconsider the arguments, focus on some specific words (e.g. "moscow could be suspended" in Arg-2a; "has not only hurt" in Arg-2b) and make the final decision on the last level.

Implicit Discourse Relation Classification
The Penn Discourse Treebank (PDTB) (Prasad et al., 2008), known as the largest discourse corpus, is composed of 2159 Wall Street Journal articles. Each document is annotated with the predicate-argument structure, where the predicate is the discourse connective (e.g. while) and the arguments are two text spans around the connective. The discourse connective can be either explicit or implicit. In PDTB, a hierarchy of relation tags is provided for annotation. In our study, we use the four top-level tags, including Temporal, Contingency, Comparison and Expansion. These four core relations allow us to be theory-neutral, since they are almost included in all discourse theories, sometimes with different names (Wang et al., 2012).
Implicit discourse relation recognition is often treated as a classification problem. The first work to tackle this task on PDTB is (Pitler et al., 2009). They selected several surface features to train four binary classifiers, each for one of the top-level PDTB relation classes. Extending from this work, Lin et al. (2009) further identified four different feature types representing the context, the constituent parse trees, the dependency parse trees and the raw text respectively. Rutherford and Xue (2014) used brown cluster to replace the word pair features for solving the sparsity problem. Ji and Eisenstein (2015) adopted two recursive neural networks to exploit the representation of arguments and entity spans. Very recently, Liu et al. (2016) proposed a twodimensional convolutional neural network (CNN) to model the argument pairs and employed a multitask learning framework to boost the performance by learning from other discourse-related tasks. Ji et al. (2016)

Neural Networks and Attention Mechanism
Recently, neural network-based methods have gained prominence in the field of natural language processing (Kim, 2014). Such methods are primarily based on learning a distributed representation for each word, which is also called a word embedding (Collobert et al., 2011). Attention mechanism was first introduced into neural models to solve the alignment problem between different modalities.
Graves (2013) designed a neural network to generate handwriting based on a text. It assigned a window on the input text at each step and generate characters based on the content within the window. Bahdanau et al. (2014) introduced this idea into machine translation, where their model computed a probabilistic distribution over the input sequence when generating each target word. Tan et al. (2015) proposed an attentionbased neural network to model both questions and sentences for selecting the appropriate non-factoid answers.
In parallel, the idea of equipping the neural model with an external memory has gained increasing attention recently. A memory can remember what the model has learned and guide its subsequent actions.  presented a neural network to read and update the external memory in a recurrent manner with the guidance of a question embedding. Kumar et al. (2015) proposed a similar model where a memory was designed to change the gate of the gated recurrent unit for each iteration.

Conclusion
As a complex text processing task, implicit discourse relation recognition needs a deep analysis of the arguments. To this end, we for the first time propose to imitate the repeated reading strategy and dynamically exploit efficient features through several passes of reading. Following this idea, we design neural networks with multiple levels of attention (NNMA), where the general level and the attention levels represent the first and subsequent passes of reading. With the help of external short-term memories, NNMA can gradually update the arguments representations on each attention level and fix attention on some specific words which provide effective clues to discourse relation recognition. We conducted experiments on PDTB and the evaluation results show that our model can achieve the state-of-the-art performance on recognizing the implicit discourse relations.