A Recurrent Neural Model with Attention for the Recognition of Chinese Implicit Discourse Relations

We introduce an attention-based Bi-LSTM for Chinese implicit discourse relations and demonstrate that modeling argument pairs as a joint sequence can outperform word order-agnostic approaches. Our model benefits from a partial sampling scheme and is conceptually simple, yet achieves state-of-the-art performance on the Chinese Discourse Treebank. We also visualize its attention activity to illustrate the model’s ability to selectively focus on the relevant parts of an input sequence.


Introduction
True text understanding is one of the key goals in Natural Language Processing and requires capabilities beyond the lexical semantics of individual words or phrases. Natural language descriptions are typically driven by an inter-sentential coherent structure, exhibiting specific discourse properties, which in turn contribute significantly to the global meaning of a text. Automatically detecting how meaning units are organized benefits practical downstream applications, such as question answering (Sun and Chai, 2007), recognizing textual entailment (Hickl, 2008), sentiment analysis (Trivedi and Eisenstein, 2013), or text summarization (Hirao et al., 2013).
Various formalisms in terms of semantic coherence frameworks have been proposed to account for these contextual assumptions (Mann and Thompson, 1988;Lascarides and Asher, 1993;Webber, 2004). The annotation schemata of the Penn Discourse Treebank (Prasad et al., 2008, PDTB) and the Chinese Discourse Treebank (Zhou and Xue, 2012, CDTB), for instance, define * Both first authors contributed equally to this work. discourse units as syntactically motivated character spans in the text, augmented with relations pointing from the second argument (Arg2, prototypically, a discourse unit associated with an explicit discourse marker) to its antecedent, i.e., the discourse unit Arg1. Relations are labeled with a relation type (its sense) and the associated discourse marker. Both, PDTB and CDTB, distinguish explicit from implicit relations depending on the presence of such a marker (e.g., because/ 因). 1 Sense classification for implicit relations is by far more challenging because the argument pairs lack the marker as an important feature. Consider, for instance, the following example from the CDTB as implicit CONJUNCTION: Arg1: 会谈就一些原则和具体问题进行了 深入讨论，达成了一些谅解 In the talks, they discussed some principles and specific questions in depth, and reached some understandings Arg2: 双 方 一 致 认 为 会 谈 具 有 积 极 成 果 Both sides agree that the talks have positive results Motivation: Previous work on implicit sense labeling is heavily feature-rich and requires domainspecific, semantic lexicons (Pitler et al., 2009;Feng and Hirst, 2012;Huang and Chen, 2011). Only recently, resource-lean architectures have been proposed. These promising neural methods attempt to infer latent representations appropriate for implicit relation classification (Zhang et al., 2015;Ji et al., 2016;Chen et al., 2016). So far, unfortunately, these models have been evaluated only on four top-level senses-sometimes even with inconsistent evaluation setups. 2 Furthermore, most systems have initially been designed for the English PDTB and involve complex, task-specific architectures (Liu and Li, 2016), while discourse modeling techniques for Chinese have received very little attention in the literature and are still seriously underrepresented in terms of publicly available systems. What is more, over 80% of all words in Chinese discourse relations are implicit-compared to only 52% in English (Zhou and Xue, 2012).
Recently, in the context of the CoNLL 2016 shared task , a first independent evaluation platform beyond class level has been established. Surprisingly, the best performing neural architectures to date are standard feedforward networks, cf. Wang and Lan (2016); Schenk et al. (2016); Qin et al. (2016). Even though these specific models completely ignore word order within arguments, such feedforward architectures have been claimed by  to generally outperform any thoroughly-tuned recurrent architecture.
Our Contribution: In this work, we release the first attention-based recurrent neural sense classifier, specifically developed for Chinese implicit discourse relations. Inspired by Zhou et al. (2016), our system is a practical adaptation of the recent advances in relation modeling extended by a novel sampling scheme.
Contrary to previous assertions by , our model demonstrates superior performance over traditional bag-of-words approaches with feedfoward networks by treating discourse arguments as a joint sequence. We evaluate our method within an independent framework and show that it performs very well beyond standard class-level predictions, achieving stateof-the-art accuracy on the CDTB test set.
We illustrate how our model's attention mechanism provides means to highlight those parts of an input sequence that are relevant for the classification decision, and thus, it may enable a better understanding of the implicit discourse parsing problem. Our proposed network architecture is flexible and largely language-independent as it operates only on word embeddings. It stands out due to its structural simplicity and builds a solid ground for further development towards other textual domains.

Approach
We propose the use of an attention-based bidirectional Long Short-Term Memory (Hochreiter .  Figure 1: The attention-based bidirectional LSTM network for the task of modeling argument pairs for Chinese implicit discourse relations. and Schmidhuber, 1997, LSTM) network to predict senses of discourse relations. The model draws upon previous work on LSTM, in particular its bidirectional mode of operation (Graves and Schmidhuber, 2005), attention mechanisms for recurrent models (Bahdanau et al., 2014;Hermann et al., 2015), and the combined use of these techniques for entity relation recognition in annotated sequences (Zhou et al., 2016). More specifically, our model is a flexible recurrent neural network with capabilities to sequentially inspect tokens and to highlight which parts of the input sequence are most informative for the discourse relation recognition task, using the weighting provided by the attention mechanism. Furthermore, the model benefits from a novel sampling scheme for arguments, as elaborated below. The system is learned in an end-to-end manner and consists of multiple layers, which are illustrated in Figure 1.
First, token sequences are taken as input and special markers (<ARG1>, </ARG1>, etc.) are inserted into the corresponding positions to inform the model on the start and end points of argument spans. This way, we can ensure a general flexibility in modeling discourse units and could easily extend them with additional context, for instance. In our experiments on implicit arguments, only the tokens in the respective spans are considered. Note that, unlike previous works, our approach models Arg1-Arg2 pairs as a joint sequence and does not first compute intermediate representations of arguments separately.
Second, an input layer encodes tokens using one-hot vector representations (t i for tokens at positions i ∈ [1, k]), and a subsequent embedding layer provides a dense representation (e i ) to serve as input for the recurrent layers. The embedding layer is initialized using pre-trained word vectors, in our case 300-dimensional Chinese Gigaword vectors (Graff and Chen, 2005). 3 These embeddings are further tuned as the network is trained towards the prediction task. Embeddings for unknown tokens, e.g., markers, are trained by backpropagation only. Note that, tokens, markers and the pre-trained vectors represent the only source of information for the prediction task.
For the recurrent setup, we use a layer of LSTM networks in a bidirectional manner, in order to better capture dependencies between parts of the input sequence by inspection of both left and righthand-side contexts at each time step. The LSTM holds a state representation as a continuous vector passed to the subsequent time step, and it is capable of modeling long-range dependencies due to its gated memory. The forward (A ) and backward (A ) LSTMs traverse the sequence e i , producing sequences of vectors h i and h i respectively, which are then summed together (indicated by ⊕ in Figure 1).
The resulting sequence of vectors h i is reduced into a single vector and fed to the final softmax output layer in order to classify the sense label y of the discourse relation. This vector may be obtained either as the final vector h produced by an LSTM, or through pooling of all h i , or by using attention, i.e., as a weighted sum over h i . While the model may be somewhat more difficult to optimize using attention, it provides the added benefit of interpretability, as the weights highlight to what extent the classifier considers the LSTM state vectors at each token during modeling. This is particularly interesting for discourse parsing, as most previous approaches have provided little support for pinpointing the driving features in each argument span.
Finally, the attention layer contains the trainable 3 http://www.cs.brandeis.edu/˜clp/ conll16st/dataset.html vector w (of the same dimensionality as vectors h i ) which is used to dynamically produce a weight vector α over time steps i by: where H is a matrix consisting of vectors h i . The output layer r is the weighted sum of vectors in H: r = Hα T Partial Argument Sampling: For the purpose of enlarging the instance space of training items in the CDTB, and thus, in order to improve the predictive performance of the model, we propose a novel partial sampling scheme of arguments, whereby the model is trained and validated on sequences containing both arguments, as well as single arguments. A data point (a 1 , a 2 , y), with a i being the token sequence of argument i, is expanded into {(a 1 , a 2 , y), (a 1 , a 2 , y), (a 1 , y), (a 2 , y)}. We duplicate bi-argument samples (a 1 , a 2 , y) (in training and development data only) to balance their frequencies against single-argument samples. Two lines of motivation support the inclusion of single argument training examples, grounded in linguistics and machine learning, respectively. First, it has been shown that single arguments in isolation can evoke a strong expectation towards a certain implicit discourse relation, cf. Asr and Demberg (2015) and, in particular, Rohde and Horton (2010) in their psycholinguistic study on implicit causality verbs. Second, the procedure may encourage the model to learn better representations of individual argument spans in support of modeling of arguments in composition, cf. LeCun et al. (2015). Due to these aspects, we believe this data augmentation technique to be effective in reinforcing the overall robustness of our model. Implementational Details: We train the model using fixed-length sequences of 256 tokens with zero padding at the beginning of shorter sequences and truncate longer ones. Each LSTM has a vector dimensionality of 300, matching the embedding size. The model is regularized by 0.5 dropout rate between the layers and weight decay (2.5e −6 ) on the LSTM inputs. We employ Adam optimization (Kingma and Ba, 2014) using the cross-entropy loss function with mini batch size of 80. 4 CDTB Development Set CDTB Test Set Rank System % accuracy Rank System % accuracy 1 Wang and Lan (2016) 73.53 1 Wang and Lan (2016) 72.42 2 Qin et al. (2016) 71.57 2 Schenk et al. (2016) 71.87 3 Schenk et al. (2016) 70.59 3  70.47 4  68.30 4 Qin et al. (2016) 67.41 5 Weiss and Bajec (2016) 66.67 5 Weiss and Bajec (2016) 64.07 6 Weiss and Bajec (2016) 61.44 6 Weiss and Bajec (2016) 63.51 7 Jian et al. (2016) 21.90 7 Jian et al. (2016)

Evaluation
We evaluate our recurrent model on the CoNLL 2016 shared task data 5 which include the official training, development and test sets of the CDTB; cf. Table 2 for an overview of the implicit sense distribution. 6 In accordance with previous setups , we treat entity relations (ENTREL) as implicit and exclude ALTLEX relations. In the evaluation, we focus on the sense-only track, the subtask for which gold arguments are provided and a system is supposed to label a given argument pair with the correct sense. The results are shown in Table 1.
With our proposed architecture it is possible to correctly label 257/352 (73.01%) of implicit rela-conll16st/ 6 Note that, in the CDTB, implicit relations appear almost three times more often than explicit relations. Out of these, 65% appear within the same sentence. Finally, 25 relations in the training set have two labels. tions on the test set, outperforming the best feedforward system of Wang and Lan (2016) and all other word order-agnostic approaches. Development and test set performances suggest the robustness of our approach and its ability to generalize to unseen data.
Ablation Study: We perform an ablation study to quantitatively assess the contribution of two of the characteristic aspects of our model. First, we compare the use of the attention mechanism against the simpler alternative of feeding the final LSTM hidden vectors (h k and h 1 ) directly to the output layer. When attention is turned off, this yields an absolute decrease in performance of 2.70% on the test set, which is substantial and significant according to a Welch two-sample t-test (p < .001). Second, we independently compare the use of the partial sampling scheme against training on the standard argument pairs in the CDTB. Here, the absence of the partial sampling scheme yields an absolute decrease in accuracy of 5.74% (p < .001), which demonstrates its importance for achieving competitive performance on the task. Performance on the PDTB: As a side experiment, we investigate the model's language independence by applying it to the implicit argument pairs of the English PDTB. Due to computational time constraints we do not optimize hyperparameters, but instead train the model using identical settings as for Chinese, which is expected to lead to suboptimal performance on the evaluation data. Nevertheless, we measure 27.09% accuracy on the PDTB test set (surpassing the majority class baseline of 22.01%), which shows that the model has potential to generalize across implicit discourse relations in a different language. CONJUNCTION: In the talks, they discussed some principles and specific questions in depth, and reached some understandings Both sides agree that the talks have positive results He said: We hope that the Macao government will continue to pay attention to these three issues, in order to find a final proper solution Peng Li said, Governor Liqi Wei has done a lot of useful work for the smooth settlement of the Macao question, we appreciate that Figure 2: Visualization of attention weights for Chinese characters with high (dark blue) and low (light blue) intensities. The underlined English phrases are semantically structure-shared by the two arguments.
Visualizing Attention Weights: Finally, in Figure 2, we illustrate the learned attention weights which pinpoint important subcomponents within a given implicit discourse relation. For the implicit CONJUNCTION relation the weights indicate a peak on the transition between the argument boundary, establishing a connection between the semantically related terms understandings-agree. Most ENTRELs show an opposite trend: here second arguments exhibit larger intensities than Arg1, as most entity relations follow the characteristic writing style of newspapers by adding additional information by reference to the same entity.

Summary & Outlook
In this work, we have presented the first attentionbased recurrent neural sense labeler specifically developed for Chinese implicit discourse relations. Its ability to model discourse units sequentially and jointly has been shown to be highly beneficial, both in terms of state-of-the-art performance on the CDTB (outperforming word order-agnostic feedforward approaches), and also in terms of insightful observations into the inner workings of the model through its attention mechanism. The architecture is structurally simple, benefits from partial argument sampling, and can be eas-ily adapted to similar relation recognition tasks. In future work, we intend to extend our approach to different languages and domains, e.g., to the recent data sets on narrative story understanding or question answering (Mostafazadeh et al., 2016;Feng et al., 2015). We believe that recurrent modeling of implicit discourse information can be a driving force in successfully handling such complex semantic processing tasks. 7