Shallow Convolutional Neural Network for Implicit Discourse Relation Recognition

Implicit discourse relation recognition remains a serious challenge due to the absence of discourse connectives. In this paper, we propose a Shallow Convolutional Neural Network (SCNN) for implicit discourse relation recognition, which contains only one hidden layer but is effective in relation recognition. The shallow structure alleviates the overﬁtting problem, while the convolution and nonlinear operations help preserve the recognition and generalization ability of our model. Experiments on the benchmark data set show that our model achieves comparable and even better performance when comparing against current state-of-the-art systems.


Introduction
As a crucial task for discourse analysis, discourse relation recognition (DRR) aims to automatically identify the internal structure and logical relationship of coherent text (e.g., TEMPORAL, CONTIN-GENCY, EXPANSION, etc). It provides important information to many other natural language processing systems, such as question answering (Verberne et al., 2007), information extraction (Cimiano et al., 2005), machine translation (Guzmán et al., 2014) and so on. Despite great progress in explicit DRR where the discourse connectives (e.g., "because", "but" et al.) explicitly exist in the text (Miltsakaki et al., 2005;Pitler et al., 2008), implicit DRR remains a serious challenge because of the absence of discourse connectives (Prasad et al., 2008).
Conventional methods for implicit DRR directly rely on feature engineering, wherein researchers generally exploit various hand-crafted features, such as words, part-of-speech tags and * Corresponding author production rules (Pitler et al., 2009;Lin et al., 2009;Louis et al., 2010;Wang et al., 2012;Park and Cardie, 2012;McKeown and Biran, 2013;Lan et al., 2013;Versley, 2013;Braud and Denis, 2014;Rutherford and Xue, 2014). Although these methods have proven successful, these manual features are labor-intensive and weak to capture intentional, semantic and syntactic aspects that govern discourse coherence (Li et al., 2014), thus limiting the effectiveness of these methods.
Recently, deep learning models have achieved remarkable results in natural language processing (Bengio et al., 2003;Bengio et al., 2006;Socher et al., 2011b;Socher et al., 2011a;Socher et al., 2013;Li et al., 2013;Kim, 2014). However, to the best of our knowledge, there is little deep learning work specifically for implicit DRR. The neglect of this important domain may be due to the following two reasons: (1) discourse relation distribution is rather unbalanced, where the generalization of deep models is relatively insufficient despite their powerful studying ability; (2) training dataset in implicit DRR is relatively small, where overfitting problems become more prominent.
In this paper, we propose a Shallow Convolutional Neural Network (SCNN) for implicit DRR, with only one simple convolution layer on the top of word vectors. On one hand, the network structure is simple, thereby overfitting issue can be alleviated; on the other hand, the convolution operation and nonlinear transformation help preserve the recognition ability of SCNN. This makes our model able to generalize better on the test dataset. We performed evaluation for English implicit DRR on the PDTB-style corpus. Experimental results show that the proposed method can obtain comparable even better performance when compares against several baselines.

Model
In Penn Discourse Treebank (PDTB) (Prasad et al., 2008), implicit discourse relations are anno-  tated with connective expressions that best convey implicit relations between two neighboring arguments, e.g. Arg1: (But) our competitions say we overbid them Arg2: who cares the connective "But", which is annotated manually, is used to express the inferred COMPARISON relation.
We learn a classifier for implicit DRR based on convonlutional neural network. The overall model architecture is illustrated in Figure 1. 1 In our model, each word in vocabulary V corresponds to a d-dimensional dense, real-valued vector, and all words are stacked into a word embedding matrix L ∈ R d×|V | , where |V | is the vocabulary size.
Given an ordered list of n words in an argument, we retrieve the i-th word representation x v i ∈ R d from L with its corresponding vocabulary index v i . All word vectors in the argument produce the following output matrix: Following previous work (Collobert et al., 2011;Socher et al., 2011a), for each row r in X, we explore three convolutional operations to obtain three convolution features average, min and max as follows: c min r = min (X r,1 , X r,2 , . . . , X r,n ) 1 For better illustration, we assume that the dimension of word vectors is 4 throughout this paper.
In this way, SCNN is able to capture almost all important information inside X (one with the highest, lowest and average values). Besides, each convolution operation naturally deals with variable argument lengths (Note that c ∈ R d ). Back to Figure  1, we present c avg , c min and c max with red, purple and green color respectively. After obtaining the features of both arguments, we concatenate all of them into one vector, and then apply tanh transformation and length normalization successively to generate the hidden layers: where h ∈ R 6d is the hidden layer representation. The normalization operation scales the components of a feature vector to unit length. This, to some extent, eliminates the manifold differences among different features. Upon the hidden layer, we stack a Softmax layer for relation recognition, where f is the softmax function, W ∈ R l×6d is the parameter matrix, b ∈ R l is the bias term, and l is the relation number.
To assess how well the predicted relation y represents the real relation, we supervise it with the gold relation g in the annotated training corpus using the traditional cross-entropy error, Combined with the regularization error, the joint training objective function is where m is the number of training instances, y t is the t-th predicted distribution, λ is the regularization coefficient and θ is parameters, including L, W and b. 2 To train SCNN, we first employ the toolkit Word2Vec 3 (Mikolov et al., 2013) to initialize the word embedding matrix L using a large-scale unlabeled data. Then, L-BFGS algorithm is applied to fine-tune the parameters θ.

Experiments
We conducted a series of experiments on English implicit DRR task. After a brief description of the experimental setup and the baseline systems, we further investigated the effectiveness of our method with deep analysis.

Setup
For comparison with other systems, we formulated the task as four separate one-against-all binary classification problems: one for each top level sense of implicit discourse relations (Pitler et al., 2009).

Baselines
We compared our model against the following baseline methods: • SVM: This method learns a support vector machine (SVM) classifier with the labeled data. • TSVM: This method learns a transductive SVM (TSVM) classifiers given the labeled data and unlabeled data. We extracted unlabeled data from above-mentioned 1.02M sentences. After filtering the noise ones, we finally obtained 0.11M unlabeled instances, each of which contains only two clauses. • RAE: This method learns a recursive autoencoder (RAE) classifier with the labeled data. We first utilized standard RAEs to represent arguments, and then stacked a Softmax layer upon them. The hyperparameters were set as follows: word dimension 64, balance factor for reconstruction error 0.10282 and regularization factor 1e −5 . Word embeddings are initialized via Word2Vec. Rutherford and Xue (2014) show that Brown cluster pair feature is very impactful in implicit DRR (Rutherford and Xue, 2014). This feature is superior to one-hot representation for the interactions between two arguments, such as crossargument word pair features in our baseline methods. We therefore conducted two additional experiments for comparison: • Add-Bro: This method learns an SVM classifier using baseline system features along with the Brown cluster pair feature.   In addition, to further verify the effectiveness of normalization, we also compared against SCNN model without normalization (SCNN-No-Norm).
Throughout our experiments, we used the toolkit SVM-light 6 (Joachims, 1999) in all the SVM-related experiments. Following previous work (Pitler et al., 2009;Lin et al., 2009), we adopted the following features for baseline methods: Bag of Words: Three binary features that check whether a word occurs in Arg1, Arg2 and both arguments. Cross-Argument Word Pairs: We group all words from Arg1 and Arg2 into two sets W 1 ,W 2 respectively, then extract any possible word pair (w i , w j )(w i ∈ W 1 , w j ∈ W 2 ) as features. Polarity: The count of positive, negated positive, negative and neutral words in Arg1 and Arg2 according to the MPQA corpus (English). Their cross products are used as features. First-Last, First3: The first and last words of each argument, the pair of the first words in two arguments, the pair of the last words in two arguments, and the first three words of each argument are used as features. Production Rules: We extract all production rules from syntactic trees of arguments. We defined three binary features for each rule to check whether this rule appear in Arg1, Arg2 and both arguments. Dependency Rules: We also extracted all dependency rules from dependency trees of arguments. Similarly, we defined three binary features for each rule to check whether this rule appear in Arg1, Arg2 and both arguments.
In order to collect bag of words, production rules, dependency rules, and cross-argument word pairs, we used a frequency cutoff of 5 to remove rare features, following Lin et al. (2009).

Results and Analysis
All models are evaluated by assessing the accuracy and F1 scores on account of the imbalance in test set. Besides, for better analysis, we also provided the precision and recall results. Table 2 summarizes the performance of different models. On the whole, the F1 scores for implicit DRR are relatively low on average: COMP., CONT., EXP. and TEMP. about 32%, 50%, 65% and 28% respectively. This illustrates the difficulty in implicit DRR. Although we ex-pected unlabeled data could obtain improvement, we observed negative results appeared in TSVM: COMP. and CONT. dropped 1.14% and 0.79% respectively 7 . The F1 scores of TEMP. and EXP. are improved (1.27% and 0.63% respectively). The main reason may be that our unlabeled data is not strictly from the discourse corpus.
Incorporating Brown cluster pair features enhances the recognition of COMP. and CONT.. Particllarly, No-Cro achieves the best result in COMP. 34.22%. But we found no consistent improvement in EXP. and TEMP.: No-Cro loses 2.74% in TEMP.; Add-Bro loses 0.88% and 2.12% in EXP. and TEMP. respectively. This result is inconsistent with the finding of Rutherford and Xue (2014). The reason may lie in the training strategy, where we used sampling to solve the problem of unbalanced dataset while they reweighted training samples.
Compared with SVM-based models, RAE performs poorly in three relations, except EXP. which has the largest training dataset. Maybe RAE needs more labeled training data for better results. However, SCNN models perform remarkably well, producing comparable and even better results. Without normalization, SCNN-No-Norm gains 0.57%, 2.98% and 3.1% F1 scores for CONT., EXP. and TEMP. respectively, but loses 2.11% for COMP.. We obtain further improvement using SCNN with normalization: 0.71%, 2.17%, 6.52% and 4.93% for COMP., CONT., EXP. and TEMP. respectively. This suggests that normalization is useful for generalization of shallow models.
From Table 2, we found that our models do not achieve consistent improvements in precision, but benefit greatly from the gains of recall. Besides, our model works quite well for small dataset (Both accuracy and F1 score are improved in TEMP.). All of these demonstrate that our model is suitable for implicit DRR.

Conclusion and Future Work
In this paper, we have presented a convolutional neural network based approach to learn better DRR classifiers. The method is simple but effective for relation recognition. Experiment results show that our approach achieves satisfactory performance against the baseline models.
In the future, we will verify our model on other 7 Without special illustration, all improvements and declines are against SVM. languages, for example, Chinese and Arabic. Besides, since our model is general to classification problems, we would like to investigate its effectiveness on other similar tasks, such as sentiment classification and movie review classification, etc.