Discourse Relation Sense Classification Using Cross-argument Semantic Similarity Based on Word Embeddings

This paper describes our system for the CoNLL 2016 Shared Task’s supplementary task on Discourse Relation Sense Classiﬁcation. Our ofﬁcial submission employs a Logistic Regression classiﬁer with several cross-argument similarity features based on word embeddings and performs with overall F-scores of 64.13 for the Dev set, 63.31 for the Test set and 54.69 for the Blind set, ranking ﬁrst in the Overall ranking for the task. We compare the feature-based Logistic Regression classiﬁer to different Convolutional Neural Network architectures. After the ofﬁ-cial submission we enriched our model for Non-Explicit relations by including similarities of explicit connectives with the relation arguments, and part of speech similarities based on modal verbs. This improved our Non-Explicit result by 1.46 points on the Dev set and by 0.36 points on the Blind set.


Introduction
The CoNLL 2016 Shared Task on Shallow Discourse Parsing (Xue et al., 2016) focuses on identifying individual discourse relations presented in text. This year the shared task has a main track that requires end-to-end discourse relation parsing and a supplementary task that is restricted to discourse relation sense classification. For the main task, systems are required to build a system that given a raw text as input can identify arguments Arg1 and Arg2 that are related in the discourse, and also classify the type of the relation, which can be Explicit, Implicit, AltLex or EntRel. A further attribute to be detected is the relation Sense, which can be one of 15 classes organized hierarchically in 4 parent classes. With this work we participate in the Supplementary Task on Discourse Relation Sense Classification in English. The task is to predict the discourse relation sense when the arguments Arg1, Arg2 are given, as well as the Discourse Connective in case of explicit marking.
In our contribution we compare different approaches including a Logistic Regression classifier using similarity features based on word embeddings, and two Convolutional Neural Network architectures. We show that an approach using only word embeddings retrieved from word2vec (Mikolov et al., 2013) and cross-argument similarity features is simple and fast, and yields results that rank first in the Overall, second in the Explicit and forth in the Non-Explicit sense classification task. Our system's code is publicly accessible 1 .

Related Work
This year's CoNLL 2016 Shared Task on Shallow Discourse Parsing (Xue et al., 2016) is the second edition of the shared task after the CoNLL 2015 Shared task on Shallow Discourse Parsing . The difference to last year's task is that there is a new Supplementary Task on Discourse Relation Sense classification, where participants are not required to build an end-to-end discourse relation parser but can participate with a sense classification system only. Discourse relations in the task are divided in two major types: Explicit and Non-Explicit (Implicit, EntRel and AltLex). Detecting the sense of Explicit relations is an easy task: given the discourse connective, the relation sense can be determined with very high accuracy (Pitler et al., 2008). A challenging task is to detect the sense of Non-Explicit discourse relations, as they usually don't have a connective that can help to determine their sense. In last year's task Non-Explicit relations have been tackled with features based on Brown clusters (Chiarcos and Schenk, 2015;Wang and Lan, 2015;Stepanov et al., 2015), VerbNet classes (Kong et al., 2015;Lalitha Devi et al., 2015) and MPQA polarity lexicon (Wang and Lan, 2015;Lalitha Devi et al., 2015). Earlier work (Rutherford and Xue, 2014) employed Brown cluster and coreference patterns to identify senses of implicit discourse relations in naturally occurring text. More recently  improved inference of implicit discourse relations via classifying explicit discourse connectives, extending prior research (Marcu and Echihabi, 2002;Sporleder and Lascarides, 2008). Several neural network approaches have been proposed, e.g., Multi-task Neural Networks (Liu et al., 2016) and Shallow-Convolutional Neural Networks (Zhang et al., 2015). Braud and Denis (2015) compare word representations for implicit discourse relation classification and find that denser representations systematically outperform sparser ones.

Method
We divide the task into two subtasks, and develop separate classifiers for Explicit and Non-Explicit discourse relation sense classification, as shown in Figure 1. We do that because the official evaluation is divided into Explicit and Non-Explicit (Implicit, AltLex, EntRel) relations and we want to be able to tune our system accordingly. During training, the relation type is provided in the data, and samples are processed by the respective classifier models in Process 1 (Non-Explicit) and Process 2 (Explicit). During testing the gold Type attribute is not provided, so we use a simple heuristic: we assume that Explicit relations have connectives and that Non-Explicit 2 relations do not.
As the task requires that the actual evaluation is executed on the provided server, we save the models so we can load them later during evaluation.
For classifying Explicit connectives we follow a feature-based approach, developing features based on word embeddings and semantic similarity measured between parts of the arguments Arg1 and Arg2 of the discourse relations. Classification is Figure 1: System architecture: Training and evaluating models for Explicit and Non-Explicit discourse relation sense classification into one of the given fifteen classes of relation senses. For detecting Non-Explicit discourse relations we also make use of a feature-based approach, but in addition we experiment with two models based on Convolutional Neural Networks.

Feature-based approach
For each relation, we extract features from Arg1, Arg2 and the Connective, in case the type of the relation is considered Explicit.

Semantic Features using Word Embeddings.
In our models we only develop features based on word embedding vectors. We use word2vec (Mikolov et al., 2013) word embeddings with vector size 300 pre-trained on Google News texts. 3 For computing similarity between embedding representations, we employ cosine similarity: Embedding representations for Arguments and Connectives. For each argument Arg1, Arg2 and Connective (for Explicit relations) we construct a centroid vector (2) from the embedding vectors w i of all words w i in their respective surface yield.
Cross-argument Semantic Vector Similarities.
We calculate various similarity features on the basis of the centroid word vectors for the arguments and the connective, as well as on parts of the arguments: Arg1 to Arg2 similarity. We assume that for given arguments Arg1 and Arg2 that stand in a specific discourse relation sense, their centroid vectors should stand in a specific similarity relation to each other. We thus use their cosine similarity as a feature.
Maximized similarity. Here we rank each word in Arg2's text according to its similarity with the centroid vector of Arg1, and we compute the average similarity for the top-ranked N words. We chose the similarity scores of the top 1,2,3 and 5 words as features. The assumption is that the average similarity between the first argument (Arg1) and the top N most similar words in the second argument (Arg2) might imply a specific sense.
Aligned similarity. For each word in Arg1, we choose the most similar word from the yield of Arg2 and we take the average of all best word pair similarities, as suggested in Tran et al. (2015).
Part of speech (POS) based word vector similarities. We used part of speech tags from the parsed input data provided by the organizers, and computed similarities between centroid vectors of words with a specific tag from Arg1 and the centroid vector of Arg2. Extracted features for POS similarities are symmetric: for example we calculate the similarity between Nouns from Arg1 with Pronouns from Arg2 and the opposite. The assumption is that some parts of speech between Arg1 and Arg2 might be closer than other parts of speech depending on the relation sense.
Explicit discourse connectives similarity. We collected 103 explicit discourse connectives from the Penn Discourse Treebank (Prasad et al., 2008) annotation manual 4 and for all of them construct vector representations according to (2), where for multi-token connectives we calculate a centroid vector from all tokens in the connective. For every discourse connective vector representation we calculate the similarity with the centroid vector representations from all Arg1 and Arg2 tokens. This results in adding 103 similarity features for every relation. We use these features for implicit discourse relations sense classification only.
We assume that knowledge about the relation sense can be inferred by calculating the similarity between the semantic information of the relation arguments and specific discourse connectives.
Our feature-based approach yields very good results on Explicit relations sense classification with an F-score of 0.912 on the Dev set. Combining features based on word embeddings and similarity between arguments in Mihaylov and Nakov (2016) yielded state-of-the art performance in a similar task setup in Community Question Answering , where two text arguments (question and answer) are to be ranked.

CNNs for sentence classification
We also experiment with Convolutional Neural Network architectures to detect Implicit relation senses. We have implemented the CNN model proposed in Kim (2014) as it proved successful in tasks like sentence classification and modal sense classification (Marasović and Frank, 2016). This model ( Figure 2) defines one convolutional layer that uses pre-trained Word2Vec vectors trained on the Google News dataset. As shown in Kim (2014), this architecture yields very good results for various single sentence classification tasks. For our relation classification task we input the concatenated tokens of Arg1 and Arg2.

Modified ARC-1 CNN for sentence matching
An alternative model we try for Implicit discourse relation sense classification is a modification of the ARC-1 architecture proposed for sentence matching by Hu et al. (2015). We will refer to this model as ARC-1M. The modified architecture is depicted in Figure 3. The input of the model are two sentences S x and S y represented as sequence of tokens' vector representations of Arg1 and Arg2.
Here, separate convolution and max-pooling layers are constructed for the two input sentences, and the results of the max-pooling layers are concatenated and fed to a single final SoftMax layer. The original ARC-1 architecture uses a Multilayer Perceptron layer instead of SoftMax. For our implementation we use TensorFlow (Abadi et al., 2015).

Classifier settings
For our feature-based approach we concatenate the extracted features in a feature vector, scale their values to the 0 to 1 range, and feed the vectors to a classifier. We train and evaluate a L2-regularized Logistic Regression classifier with the LIBLIN-EAR (Fan et al., 2008) solver as implemented in scikit-learn (Pedregosa et al., 2011). For most of our experiments, we tuned the classifier with different values of the C (cost) parameter, and chose C=0.1 as it yielded the best accuracy on 5-fold cross-validation on the training set. We use these settings for all experiments that use the logistic regression classifier.

Official submission (LR with E+Sim)
Our official submission uses the feature-based approach described in Section 3.1 for both Explicit and Non-Explicit relations with all features de-scribed above, except for the Explicit connective similarities (Conn) and Modal verbs similarities (POS MD) which have been added after the submission deadline. Table 1 presents the results divided by senses from our official submission performed on the TIRA evaluation platform (Potthast et al., 2014) server. We also compare our official and improved system results to the best performing system in the CoNLL 2015 Shared Task (Wang and Lan, 2015) and the best performing systems in the CoNLL 2016 Discourse Relation Sense Classification task. With our official system we rank first in the Overall 5 ranking. We rank second in the Explicit ranking with a small difference of 0.07 behind the best system and fourth in the Non-Explicit ranking with more significant difference of 2.75 behind the best system. We can see that similar to (Wang and Lan, 2015) our system performs well in classifying both types, while this year's winning systems perform well in their winning relation type and much worse in the others 6 .

Further experiments on Non-Explicit relations
In Table 2 we compare different models for Non-Explicit relation sense classification trained on the Train and evaluated on the Dev set.
Embeddings only experiments. The first three columns show the results obtained with three approaches that use only features based on word embeddings. We use word2vec word embeddings. We also experimented with pre-trained dependency-based word embeddings (Levy and Goldberg, 2014), but this yielded slightly worse results on the Dev set.  Table 1: Evaluation of our official submission system, trained on Train 2016 and evaluated on Dev, Test and Blind sets. Comparison with our official system and our improved system with the official results of CoNLL 2015 Shared task's best system (Wang and Lan, 2015) and CoNLL 2016 Shared Task best systems in Explicit (Jain, 2016) and Non-Explicit (Rutherford and Xue, 2016). F-Score is presented.

Logistic Regression (LR
system parameters as proposed in Kim (2014): filter windows with size 3,4,5 with 100 feature maps each, dropout probability 0.5 and mini-batch of size 50. We train the model with 50 epochs.

CNN ARC-1M experiments
The CNN ARC-1M column shows results from our modification of ARC-1 CNN for sentence matching (see Section 3.3) fed with Arg1 and Arg2 word tokens' vector representations from the Word2Vec word embeddings. We use filter windows with size 3,4,5 with 100 feature maps each, shared between the two argument convolutions, dropout probability 0.5 and mini-batch of size 50 as proposed in Kim (2014). We train the model with 50 epochs.
Comparing LR, CNN and CNN ARC-1M according to their ability to classify different classes we observe that CNN ARC-1M performs best in detecting Contingency.Cause.Reason and Contingency.Cause.Result with a substantial margin over the other two models. The CNN model outperforms the LR and CNN-ARC1M for Comparison.Contrast, EntRel, Expansion.Conjunction and Expansion.Instantiation but cannot capture any Expansion.Restatement which leads to worse overall results compared to the others. These insights show that the Neural Network models are able to capture some dependencies between the relation arguments. For Contingency.Cause.Results, CNN ARC-1M even clearly outperforms the LR models enhanced with similarity features (discussed below). We also implemented a modified version of the CNN ARC-2 architecture of Hu et al. (2015), which uses a cross-argument convolution layer, but it yielded much worse results. 7 LR with Embeddings + Features The last three columns in Table 2 show the results of our featurebased Logistic Regression approach with different feature groups on top of the embedding representations of the arguments. Column E+Sim shows the results from our official submission and the other two columns show results for additional features that we added after the submission deadline.
Adding the cross-argument similarity features (without the POS modal verbs similarities) improves the overall result of the embeddings-only Logistic Regression (LR) baseline significantly from F-score 35.54 to 40.32. It also improves the result on almost all senses individually. Adding Explicit connective similarities features improves the All result by 0.67 points (E+Sim+Conn). It also improves the performance on Tem-

Conclusion and Future work
In this paper we describe our system for the participation in the CoNLL Shared Task on Discourse Relation Sense Classification. We compare different approaches including Logistic Regression classifiers using features based on word embeddings and cross-argument similarity and two Convolutional Neural Network architectures. Our official submission uses a logistic regression classifier with several similarity features and performs with overall F-scores of 64.13 for the Dev set, 63.31 for the Test set and 54.69 for the Blind set. After the official submission we improved our system We could show that dense representations of arguments and connectives jointly with crossargument similarity features calculated over word embeddings yield competitive results, both for Explicit and Non-Explicit relations. First results in adapting CNN models to the task show that further gains can be obtained, beyond LR models.
In future work we want to explore further deep learning approaches and adapt them for discourse relation sense classification, using among others Recurrent Neural Networks and CNNs for matching sentences, as well as other neural network models that incorporate correlation between the input arguments, such as the MTE-NN system (Guzmán et al., 2016a;Guzmán et al., 2016b). Since we observe that the neural network approaches improve on the LR Embeddings-only models for most of the senses, in future work we could combine these models with our wellperforming similarity features. Combining the output of a deep learning system with additional features has been shown to achieve state of the art performance in other tasks (Kreutzer et al., 2015).