Do We Really Need All Those Rich Linguistic Features? A Neural Network-Based Approach to Implicit Sense Labeling

We describe our contribution to the CoNLL 2016 Shared Task on shallow discourse parsing. 1 Our system extends the two best parsers from previous year’s competition by integration of a novel implicit sense labeling component. It is grounded on a highly generic, language-independent feedforward neural network architecture incorporating weighted word embeddings for argument spans which obviates the need for (traditional) hand-crafted features. Despite its simplicity, our system overall outperforms all results from 2015 on 5 out of 6 evaluation sets for English and achieves an absolute improvement in F 1 -score of 3.2% on the PDTB test section for non-explicit sense classiﬁcation.


Introduction
Text comprehension is an essential part of Natural Language Understanding and requires capabilities beyond capturing the lexical semantics of individual words or phrases. In order to understand how meaning is established, altered and transferred across words and sentences, a model is needed to account for contextual information as a semantically coherent representation of the logical discourse structure of a text. Different formalisms and frameworks have been proposed to realize this assumption (Mann and Thompson, 1988;Lascarides and Asher, 1993;Webber, 2004).
In a more applied NLP context, shallow discourse parsing (SDP) aims at automatically de-tecting relevant discourse units and to label the relations that hold between them. Unlike deep discourse parsing, a stringent logical formalization or the establishment of a global data structure, for instance, a tree, is not required.
With the release of the Penn Discourse Treebank (Prasad et al., 2008, PDTB) and the Chinese Discourse Treebank (Zhou and Xue, 2012, CDTB), annotated training data for SDP has become available and, as a consequence, the field has considerably attracted researchers from the NLP and IR community. Informally, the PDTB annotation scheme describes a discourse unit as a syntactically motivated character span in the text, augmented with relations pointing from the second argument (Arg2, prototypically, a discourse unit associated with an explicit discourse marker) to its antecedent, i.e., the discourse unit Arg1. Relations are labeled with a relation type (its sense) and the associated discourse marker (either as found in the text or as inferred by the annotator). PDTB distinguishes explicit and implicit relations depending on whether such a connector or cue phrase (e.g., because) is present, or not. 2 As an illustrative example without such a marker, consider the following two adjacent sentences from the PDTB: Arg1: The real culprits are computer makers such as IBM that have jumped the gun to unveil 486-based products.
Arg2: The reason this is getting so much visibility is that some started shipping and announced early availability.
In this implicit relation, Arg1 and Arg2 are directly related. The discourse relation type is Expansion.Restatement-one out of roughly twenty finegrained tags marking the sense relation between any given argument pair in the PDTB.
Our Contribution: We participate in the CoNLL 2016 Shared Task on SDP (Xue et al., 2016;Potthast et al., 2014) and propose a novel, neural network-based approach for implicit sense labeling. Its system architecture is modular, highly generic and mostly language-independent, by leveraging the full power of pre-trained word embeddings for the SDP sense classification task. Our parser performs well on both English and Chinese data and is highly competitive with the state-of-the-art, though does not require manual feature engineering as employed in most prior works on implicit SDP, but rather relies extensively on features learned from data.

Related Work
Most of the literature on automated discourse parsing has focused on specialized subtasks such as: 1. Argument identification (Ghosh et al., 2012;Kong et al., 2014) 2. Explicit sense classification  3. Implicit sense classification (Marcu and Echihabi, 2002;Lin et al., 2009;Zhou et al., 2010;Park and Cardie, 2012;Biran and McKeown, 2013;Rutherford and Xue, 2014) A minimal requirement for any full-fledged endto-end discourse parser is to integrate at least these three processes into a sequential pipeline. However, until recently, only a handful of such parsers have existed (Lin et al., 2014;Biran and McKeown, 2015;duVerle and Prendinger, 2009;Feng and Hirst, 2012). It has been enormously difficult to evaluate the performance of these systems among themselves, and also to compare the efficiency of their individual components with other competing methods, as i.) those systems rely on different theories of discourse, e.g., PDTB or RST; and ii) different (sub)modules involve custom settings, feature-and tool-specific parameters, (esp. for the most challenging task of implicit sense labeling). Furthermore, iii) most previous works are not directly comparable in terms of overall accuracies as their underlying evaluation data suffers from inconsistent label sizes among studies (e.g., full sense inventory vs. simplified 1-or 2-level classes, cf. Huang and Chen (2011)).
Fortunately, with the first edition of the shared task on SDP, Xue et al. (2015) had established a unified framework and had made an independent evaluation possible. The best performing participating systems -most notably those by Wang and Lan (2015) and Stepanov et al. (2015) -have reimplemented the well-established techniques, for example the one by Lin et al. (2014).

Deep Learning Approaches to SDP
In last year's shared task, first implementations on deep learning have seen a surge of interest:  and  proposed a recurrent neural network for argument identification and a paragraph vector model for sense classification. Distributed representations for both arguments were obtained by vector concatenation of embeddings.
An earlier attempt in a similar direction of representation learning (Bengio et al., 2013) has been made by Ji and Eisenstein (2014). The authors demonstrated successfully how to discriminatively learn a latent, low-dimensional feature representation for RST-style discourse parsing, which has the benefit of capturing the underlying meaning of elementary discourse units without suffering from data sparsity of the originally high dimensional input data.
Closely related, Li et al. (2014) introduced a recursive neural network for discourse parsing which jointly models distributed representations for sentences based on words and syntactic information. The approach is motivated by Socher et al. (2013) and models the discourse unit's root embedding to represent the whole discourse unit which is being obtained from its parts by an iterative process. Their system is made up of a binary structure classifier and a multi-class relation classifier and achieves similar performance compared to Ji and Eisenstein (2014).
Very recently, Liu et al. (2016) and  have successfully applied convolutional neural networks to model implicit relations within the PDTB-framework. Along these lines and inspired by the work in Weiss (2015), we also see great potential in the use of neural network-based techniques to SDP. Similarly, our approach trains a modular component for shallow discourse parsing which incorporates distributed word representations for argument spans by abstraction from surface-level (token) information. Crucially, our approach substitutes the traditional sparse and hand-crafted features from the literature to account for a minimalist, but at the same time, general (latent) representation of the discourse units. In the next sections, we elaborate on our novel neural network-based approach for implicit sense labeling and how it is fit into the overall system architecture of the parser.

A Neural Sense Labeler for Implicit and Entity Relations
We construct a neural network-based module for the classification of senses for both implicit and entity (EntRel) relations. 3 As a very general and highly data-driven approach to modeling discourse relations, our classifier incorporates only word embeddings and basic syntactic dependency information. Also, in order to keep the setup easily adaptable to new data and other languages, we avoid the use of very specific and costly hand-crafted features (such as sentiment polarities, word-pair features, cue phrases, modality, production rules, highly specific semantic information from external ontologies such as VerbNet, etc.), which has been the main focus in traditional approaches to SDP (Huang and Chen, 2011; Park and Cardie, 2012;Feng and Hirst, 2012). Instead, we substitute (sparse) tokens in the argument spans, with dense, distributed representations, i.e. word embeddings, as the main source of information for the sense classification component. Closely related,  have explored a similar approach of constructing argument vectors by applying a set of aggregation functions on their token vectors, however, without the use of additional (syntactic) information, while embedding their vectors into a single-layer neural network only.
In our experiments, we used the pre-trained GoogleNews vectors (for English) and the Gigaword-induced vectors (for Chinese) provided by the shared task as a starting point. 4 We further trained the word vectors on the raw Wall Street Journal texts, thus tuning the embeddings toward the data at hand, with the goal of considerably im-proving their predictive power in the sense classification task. Specifically, the pre-trained vectors of size 300 were updated by the skip-gram method (Mikolov et al., 2013) 5 in multiple passes over the Newswire texts with decreasing learning rate. This procedure is supposed to improve the quality of the embeddings and also their coverage.
Our new word vector model provides general vector representations for each token in the two argument spans 6 , which forms the basis for producing compositional vectors to represent the two spans. Compositional vectors that introduce a fixed-length representation of a variable-length span of tokens are practical features for feedforward neural networks. Thus, we may combine the token vectors of each span by simply averaging vectors, or -following Mitchell and Lapata (2008) -by calculating an aggregated argument vector v : for arguments j ∈ {1, 2}, where k(j) = |t(j)| defines their lengths in the number of tokens and applies the pointwise product over the token vectors in V (j).
Both procedures produce rather simple argument representations that do not account for word order variation or any other sentence structure information, yet they serve as decent features for discourse parsing and other related tasks. By introducing pointwise multiplication of the token vectors, the elements that represent assumed independent, latent semantic dimensions are not merely lumped together across vectors, but are allowed to scale according to their mutual relevance. 7 Improving upon the compositional representation produced by Equation 1, we incorporate additional syntactic dependency information: for each token in an argument span, we calculate the depth d from the corresponding sentence's root node and weight the token vector by 1 2 d before applying the 5 We found window size of 8 and min term count = 3 to be optimal. Neural networks were trained using the gensim package: http://radimrehurek.com/gensim/. 6 We ignore unknown tokens for which no vectors exist. 7 In our experiments, Equation 1 outperformed simpler strategies of either average or multiplication alone. This also indicates that it is beneficial to not completely suppress dimensions with near-zero values for single tokens. aggregating operators. 8 The bottom of Figure 1 illustrates the first step of the process, i.e. mapping tokens to their corresponding vectors based on the updated word vector model, as well as the token depth weighting. Secondly, the aggregation operators are applied, i.e., the sum (+) of the pointwise product ( / ) and average (avg) of the vectors. Finally, the compositional vectors for each of the arguments are concatenated (⊕) and serve as input to a feedforward neural network.
Given the composed argument vectors, we set up a network with one hidden layer and a softmax output layer to classify among 20 implicit senses for English and 9 for Chinese, plus an additional EntRel label. Other relations, such as AltLex, are not modeled. We train the network using Nesterov's Accelerated Gradient (Nesterov, 1983) and optimized all hyper-parameters on the development set. Best results were achieved with rectified linear activation with learnable leak rate and gain 8 Tokens that are missing in the parse tree, such as punctuation symbols, are weighted by 0.25, in our optimal setting.

The Competition Tasks & Pipelines
We participate in the closed track of the shared task, specifically in both full and supplementary tasks (sense-only) on English and Chinese texts. Full tasks require a participant's system to identify argument pairs and to label the sense relation that holds between them. In each supplementary task, gold arguments are provided so that the performance of sense labeling does not suffer from error propagation due to incorrectly detected argument spans.
We combine different existent modules to address the specific settings and classification needs of both full and supplementary tasks for both lan-9 The learning rate was set to 0.0001. Momentum of 0.35-0.6 and 60 hidden nodes performed well for the English tasks, and momentum of 0.85 and 40 hidden nodes for Chinese (with fewer output nodes). Good results were also obtained by Parametric Rectified Linear Unit (prelu) activation, as well as the combination of larger hidden layer and stronger regularization (e.g., L1 regularization of 0.1 on 100 nodes). guages. The modules and their combination with our implicit neural sense classifier will be outlined in the following sections.

English Full Task Pipeline (EFTP)
For the full task, we exploit the high-quality argument extraction modules of the two bestperforming systems by Wang and Lan (2015, W&L) and Stepanov et al. (2015) from last year's competition (re-using their original implementations): Specifically, we initially run both systems for all explicit relations only, and keep those predicted arguments and sense labels -from either of the two systems -which maximize F 1 -score on the development set. With this simple heuristic, we hope to improve upon the best results from W&L, as, for instance, Stepanov et al. (2015) perform particularly well on all temporal relations, while W&L's tool handles the majority of other senses well.
For all implicit and EntRel relations, we keep the exact argument spans obtained from the W&L system and reject all sense labels. In a second step, we re-classify all these implicit relations by our neural net-based architecture described in Section 3 given only the tokens and their dependencies in both argument spans. Finally, we merge all combined explicit and re-classified implicit relations into the final set for evaluation.

English Supplementary Task Pipeline (ESTP)
We make use of the system by Stepanov et al. (2015) to label all explicit relation senses, and classify all other relations with an empty token list for connectors (i.e., implicit and EntRels) by our neural network architecture from Section 3.

Chinese Full Task Pipeline (CFTP)
Since for the Chinese full task no reusable argument extraction tools were available, we have set up a minimalist (baseline) implementation whose individual steps we sketch briefly: 1. Connective detection is realized by means of a sequence labeling/CRF model. 10 Features are unigram and bigram information from the tokens, their parts-of-speech, dependency head, dependency chain, whether the token is found as a connector in the training set, and its relative position within the sentence.
10 https://taku910.github.io/crfpp/ 2. Argument extraction is based on the output of predicted connectives for both inter-and intrasentence relations. As an additional feature, we found the IOB chain for the syntactic path of a token to be useful. 11 3. We heuristically post-process the CRF-labeled argument tokens in order to assign connectors to same-sentence or separate-sentence Arg1 and Arg2 spans.
4. The so-obtained explicit argument pairs are sense labeled by a (linear-kernel) SVM classifier 12 with the connector word as the only feature, following the minimalist setting in Chiarcos and Schenk (2015).
5. As implicit relations we consider all intersentential relations which are not already part of an explicit relation. Same-sentence relations are ignored altogether.

Chinese Supplementary Task Pipeline (CSTP)
For the provided argument pairs, we label explicit relations (i.e. those containing a non-empty connector) by the SVM classifier which has been trained using only a single feature -the connector token. For all other relations, we again employ our neural network-based strategy described in Section 3. The overall architecture is exactly the same as for the English subtask; only the (hyper)parameters have been updated in accordance with the Chinese training data. Task   Table 1 shows the performance of our full-task pipeline (EFTP) which integrates our novel feedforward neural network architecture for implicit sense labeling. The figures suggest that our minimalist approach is highly competitive and can even outperform the best results from last year's competition in terms of F 1 -scores on two out of three evaluation sets (cf. last implicit column). Overall, with the integration of the combined systems by W&L and Stepanov et al. (2015), we can improve upon the state-of-the-art by an absolute increase in F 1 -score of 0.5% on the blind test set-which is marginal but only due to the fruitful re-classification of the already-provided (and therefore fixed) argument spans.

English Full
Measured on the development set, we found that the dependency depth weighting contributes to an absolute improvement in accuracy of 1.5% for non-explicit relations.

English Supplementary Task
Without error propagation from argument identification, and with the gold arguments provided in the evaluation sets, the performance of our implicit sense labeling component is even better; cf. Table 2: on both PDTB evaluation sets F 1 -scores increase by 2.7% and 3.16% (absolute) and by 6.32% and up to 9.17% (relative) on the development and test section, respectively. Strikingly, however, the prediction quality on the blind test set is worse than expected. We assume that this is partly due to the (slightly) heterogeneous content of the annotated Wikinews, as opposed to the original Penn Discourse Treebank data on which our system performs extraordinarily well.  Table 2: English sense-only task F 1 -scores.

Chinese Full Task
This year's edition of the shared task has been the first to address shallow discourse parsing for Chinese Newswire texts. Given no prior (directly comparable) results on Chinese SDP so far, we simply report the performance of our system on all evaluation sets in Table 3 Table 3: Chinese full task F 1 -scores.

Chinese Supplementary Task
A final evaluation has been concerned with the sense-only labeling of gold-provided arguments for Chinese. We want to point out that the neural network architecture for implicit relations (with 70.59% F 1 -score on the dev set, cf.

Conclusion
In the context of the CoNLL 2016 Shared Task on shallow discourse parsing, we have described our participating system and its architecture. Specifically, we introduced a novel feedforward neural network-based component for implicit sense labeling whose only source of information are pretrained word embeddings and syntactic dependencies. Its highly generic and extremely simple design is the main advantage of this module. It has proven to be language-independent, easy to tune and optimize and does not require the use of handcrafted -rich -linguistic features. Still its performance is highly competitive with the state-of-the-art on implicit sense labeling and builds a solid groundwork for future extensions.