The DCU Discourse Parser: A Sense Classification Task

This paper describes the discourse parsing system developed at Dublin City University for participation in the CoNLL 2015 shared task. We participated in two tasks: a connective and argument identiﬁcation task and a sense classiﬁcation task. This paper focuses on the latter task and especially the sense classiﬁcation for implicit connectives.


Introduction
This paper describes the discourse parsing system developed at Dublin City University for participation in the CoNLL 2015 shared task . We participated in two tasks: a connective and argument identification task and a sense classification task. This paper focuses on the latter task.
We divide the whole process into two stages: the first stage concerns an identification of triples (Arg1, Conn, Arg2) and pairs (Arg1, Arg2) while the second stage concerns a sense classification of the identified individual triples and pairs. The first phase of the identification of connective and arguments are described in (Wang et al., 2015), which bases on the framework of (Lin et al., 2009) and is also presented in this shared task as a different paper. Hence, we omit the detailed description of the first stage (See (Wang et al., 2015) for identification of connectives and arguments). This paper focuses on the second stage which concerns sense classification.

Sense Classification
We use off-the-shelf classifiers with four kinds of features: relational phrase embedding, production, word-pair and heuristic features. Among them, we test the method which incorporates relational phrase embedding features for Arg1 and Arg2 for  (Lin et al., 2014) and word-pair features are reported in (Lin et al., 2014;. Heuristic features, which is specific for explicit sense classification, are described in (Lin et al., 2014). We consider the embedding models which lead to two different types of intermediate representations. The relational phrase embedding model considers the dependency within words uniformly without considering the second-order effect. The word-pair embedding model considers the secondorder effect of specific combinations within the word-pairs in Arg1 and Arg2. If we plug in a paragraph vector model for the relational phrase embedding model, the model considers the effect of uni-gram within a sentence as a sequence. If we plug in a RNN-LSTM model (Le and Zuidema, 2015), the model considers the effect of uni-gram within a sentence as a tree.

Relational Phrase Embedding Features
Phrase embeddings (or sentence embeddings) are distributed representation in a higher level than a word level. We used a paragraph vector model to obtain these phrase embeddings (Le and Mikolov, 2014). Upon obtained the phrase embeddings for Arg1, Arg2 (and Connectives), we used the relational phrase embedding from these triples (or pairs) based on their phrase embeddings (Bordes et al., 2013).
The first type of embedding we used in this paper is a combination of paragraph vector (Le and Mikolov, 2014) and translational embeddings (Bordes et al., 2013). First, the abstraction of each variable Arg1 and Arg2 was built independently in a vertical way, and then the relation among these (Arg1,Conn,Arg2) and (Arg1,Arg2) are examined in a collective way. This is shown in Figure 3. This model has two intermediate embeddings: paragraph vector embeddings of Arg1, Arg2, and Conn, and translational embedding of (Arg1,Conn,Arg2) and (Arg1,Arg2).  We use a paragraph vector model to obtain the feature for Arg1 and Arg2 (Le and Mikolov, 2014). The paragraph vector model is an idea to obtain a real-valued vector in the similar construction with the word vector model (or word2vec) (Mikolov et al., 2013b) where the detailed explanation can be obtained.
In implicit/explicit sense classification, the participated items related to this classification are two for implicit relations of a pair (Arg1, Arg2) and three for explicit relations of a triple (Arg1, Conn, Arg2). This is by nature a multiple-instance learning setting (Dietterich et al., 1997), which receives a set of instances which are labeled collectively instead of individually labeled where each contains many instances. All the more, linguistic characteristics of discourse relations support this: meaning/sense is attached not to a single argument Arg1 or Arg2 but to a pair (Arg1, Arg2) or a triple (Arg1, Conn, Arg2).
Followed by Bordes et al. (Bordes et al., 2011;Bordes et al., 2013), we minimized a marginbased ranking criterion over the pair of embeddings: where [x] + denotes the positive part of x, γ > 0 is a margin hyperparameter. S ′ denotes a set of corrupted pair where Arg1 or Arg2 is replaced by a random entity (but not both at the same time). Readers should see the detailed explanation in (Bordes et al., 2013).
It is noted that we tried indicator function (alternatively called discrete-valued vector, bucket function (Bansal et al., 2014), binarization of embeddings (Guo et al., 2014)) for embeddings which are converted from real-valued vector. Although we have not tested sufficiently due to the timing constraint, we did not include this method in our experiments since we could not have any gain.

Production Features for Constituent
Parsing (Lin et al., 2014) describes the method using the production features based on the parsing results.  Table 2: Extraction of production features for constituent parsing results.
In this paper, we further process and treat these as the phrase embeddings. The algorithm is as follows. First, the subset of (constituent) parsing results which correspond to Arg1 and Arg2 are extracted. Then, all the production rules for these subtrees are derived. Third, we apply these production rules into the relational phrase embedding model that we described in 2.1. We replace all the words in 2.1 with production rules.

Word-Pair Features
Word-pair features in discourse parsing indicate the Cartesian products of all the combinations of words in Arg1 and Arg2. This feature is used in (Lin et al., 2014;. (Rutherford and  further developed this method combined with Brown clustering (Brown et al., 1992). We use this by word-pair embedding.
The second type of embedding we used in this paper is an abstraction of word-pair embedding in Arg1, Arg2 (and Conn) in a horizontal way. This is shown in Figure 4. The word grows their bigram in terms of Cartesian product of elements in different Arg1 and Arg2 which has a order from Arg1 to Arg2 where this bi-gram is embedded in the word embedding. Followed by Pitler et al. (Pitler et al., 2008) we use the 100 frequent wordpairs in training set for each category of relation. We did not delete function words/stop-words.

Heuristic Features for Explicit Connectives
Heuristic features in this paper indicate the specific features used in the explicit sense classification: (1) connective, (2) POS of connective, and (3) connective + previous word (Lin et al., 2014). These three features are employed in order to resolve the ambiguity in discourse connectives, and practically work fairly efficiently.

Experimental Settings
For the dataset, we used the CoNLL 2015 Shared task data set, i.e. LDC2015E21  and Skip-gram neural word embeddings (Mikolov et al., 2013a) 8 ). For the unofficial run, we used westbury version of English wikipedia dump (such as Figure 2) and WMT14 data set. 9 We choose python as the language to develop our discourse parser. We use external tools such as libSVM (Chang and Lin, 2011), liblinear (Fan et al., 2008), wapiti (Lavergne et al., 2010), and maximum entropy model 10 for a classification task described as Section 2. Among these off-the-shelf classifiers, we used libSVM for the official re-  sults. Additionally we use word2vec (Mikolov et al., 2013b) and Theano (Bastien et al., 2012) 11 in the pipeline.
One bottleneck of our system was in a training procedure. Since a paragraph vector is currently not incrementally trainable, we were not able to separate training and test phases. Hence, we need to run it all on TIRA, 12 whose computing resource is powerless which took a considerable time such as 15 to 30 minutes where most of other participants only finish their run in 30 seconds or so. Table 3 shows our results. There are fifteen columns where the nine columns in the left show the overall task while the six columns in the right shows the supplementary task. 13 In terms of the evaluation for explicit connectives, we obtained F score of 0.138, 0.108, and 0.077 for dev/test/blind sets for overall task (the lowest low in the second group) while we obtained F score of 0.707 for sense classification task. For the connectives, F score was 0.863 while Arg 1-2 was 0.186 which was fairly low. This may be result in the policy of the evaluation script which checks the correct classification results together with the correct identification of triples (Arg1, Conn, Arg2). Hence, even if the classification results were correct if the triples (Arg1, Conn, Arg2) were not correctly identified, the results were not correct. Thus, this explains why there is a big difference between the overall task (left nine columns) and the sense classification task (right three columns), as well as the low scores of 0.138, 0.108 and 0.077.

Experimental Results
For the implicit only evaluation, on contrast, we obtained F score of 0.019, 0.025, and 0.016 (the lowest row in the third group) for overall task and 0.105 for sense classification task. Here, precision was high (precision of these which were 0.699, 0.667, and 0.598) for overall task and 0.803 and 0.891 for sense classification task; while recall dev set (official results) dev set (unofficial results) Explicit Implicit Implicit(30m) Implicit (prod) Implicit ( Exp -- Exp.  Table 4: Results for devset (Official and unofficial results). Implicit only includes Implicit, EntRel, and AltLex. This experiment uses the development set. The right most column Implicit only(30m) shows the results with additional data of 30M sentence pairs using the same setting of Figure 2.
was very low. Table 5 shows the detailed results for sense classification under the setting that identification of connectives and arguments are correct. The first group (the left three columns) show the results for explicit classification. On contrast to implicit classification all the figures are considerably good except Comp.Conc whose F score was 0.14. The second group to the fifth group (the rightmost three columns) show four configurations of implicit classification. The third group shows the 30 million additional sentence pairs for training, the fourth group uses production feature, and the fifth group uses word-pair feature. These three groups exposed each characteristics quite clearly. Relational phrase embeddings (Implicit and Implicit(30m)) works for Expansion group (Exp.Conj, Exp.Inst, Exp.Rest), the production feature (marked as Implicit(prod)) worked for Temporal group (Temp.As.Pr and TempSyn), and the word-pair feature (marked as Implicit(wp)) worked for Comparison/Contingency groups. The effect of additional data was shown in the third group (marked as Implicit (30m)). This group was given additional data of 30M sentence pairs which improved the performance on Exp.Conj (from F score 0.07 to 0.18), and Exp.Inst (from F score 0.00 to 0.12) while Exp.Rest was down from fi score 0.07 to 0.02. The effect was limited to these categories.
It is easily observed that if the surface form of connective does not share multiple senses, such as if (67%) in Cont.Cond and instead (87%) in Exp.Alt.C, the results of sense classification performed good where Cont.Cond was F score of 0.94 and Exp.Alt.C was F score of 0.91. If the surface form of connective share multiple senses, they tend to be classified unbalancedly and one sense tends to be collected many votes. (For example, But has multiple senses, including Comp.Conc, Comp, and Comp.Cont. Comp.Cont collected many votes. As a result, the classification results for Comp.Cont was good but for others they were bad).

Discussion
A paragraph vector is proven useful for the sentiment analysis-typed task (Le and Mikolov, 2014  the averaged embedding in a sentence will perform meaning establishment in the intermediate representation which capture the characteristics of Arg1, Arg2, and Conn. First, Comp.Cont or Comp.Conc may include sentence polarity with some additional condition that these polarities may be reversed. Against our expectation only a handful of examples were classified in these categories. However, if they are classified in these categories they were correct, i.e. precision 1. Second, if Arg1 and Arg2 are required to expose the causal relation such as Cont.Cau.Rea and Cont.Cau.Res this may be beyond the framework of a paragraph vector. Third, our implicit classification tried to classify Exp.Conj and Exp.Rest. Both of these categories of relation can be found some similarities with sentiment analysis/polarities, which can be reasonable that it worked for these categories. Four, interestingly, the word-pair feature works for Comparison/Contingency sense group while the production feature works (only slightly though) for Temporal sense group.
We used a margin-based ranking criteria to obtain relations over a paragraph vectors. First, (Mikolov et al., 2013b) observed a linear relation on two word embeddings. However, it might be too heavy expectation for two paragraph embeddings which can capture the similar phenomenon. Even if Arg1 consists of many words, a paragraph vector will average their word embeddings. In this sense this approach may have a crucial limit together with the fact that this is unsupervised learning. Second, we do not know yet but some small trick may improve the relation of Comp.Cont or Comp.Conc since these relations are quite similar relations with Exp.Conj, Exp.Instantiation, and Exp.Rest except that these relations are the polarities reversed.

Conclusion
This paper describes the discourse parsing system developed at Dublin City University for participation in the CoNLL 2015 shared task. We take an approach based on a paragraph vector. One shortcoming was that our classifier was effective only Exp.Conj, Exp.Inst and Exp.Rest despite our expectation that this model will work for Comp.Cont and Comp.Conc as well. The relation of the latter is in an opposite direction. We provided the wordpair model which works for these categories but in a different perspective.
Further work includes the mechanism how to make it work for Comp.Cont and Comp.Conc. Although a paragraph vector did not work efficiently, our model has a tentative model which does not have interaction between relational, paragraph, and word embeddings such as in (Denil et al., 2015), which is one immediate challenge. Then, other challenge includes replacement of a paragraph vector model with a convolutional sentence vector model (Kalchbrenner et al., 2014) and RNN-LSTM model (Le and Zuidema, 2015). The former approach is related to the supervised learning instead of unsupervised learning. The latter approach is to employ the structure of tree instead of a sequence.