A Shallow Discourse Parsing System Based On Maximum Entropy Model

This paper describes our system for Shal-low Discourse Parsing - the CoNLL 2015 Shared Task. We regard this as a classi-ﬁcation task and build a cascaded system based on Maximum Entropy to identify the discourse connective, the spans of two arguments and the sense of the discourse connective. We trained the cascaded models with a variety of features such as lexical and syntactic features. We also report the results achieved by our team.


Introduction
Discourse parsing is one of the most challenging tasks in natural language processing (NLP) field. It focuses on parsing the structure of a piece of text into a set of discourse relations between inter sentences. There is considerable interest in discourse parsing, both as an end in itself and as an intermediate step in a variety of NLP applications like question answering (Verberne et al., 2007), text summarization (Louis et al., 2010), sentiment analysis and opinion mining (Somasundaran, 2010).
There are many approaches working on identifying the discourse relations and data-driven approaches are dominated. A number of pioneers take the discourse relations identification as a classification task (Marcu and Echihabi, 2002;Pitler et al., 2009;Duverle and Prendinger, 2009) by the construction of features like lexical, syntactic and constituent features. Some take the argument segmentation task as a semantic role detection task (Wellner and Pustejovsky, 2007) and a sequence labeling task (Ghosh et al., 2011). However, some of the previous research is based on different corpus, lacking an common evaluation data set. This has been addressed with the release of Penn Discourse Treebank (PDTB) 2.0 corpus (Prasad et al., 2008) which provides detailed annotations about the discourse relations and argument spans addresses this problem. Besides, much research about discourse parsing working on the PDTB appears (Prasad et al., 2010;Lin et al., 2009) and they put more attention on the "harder" part -labelling the arguments. Lin (Lin et al., 2014) designed an end-to-end discourse parser with the PDTB including the explicit, implicit sense and the argument spans identification.
Shallow Discourse Parsing (Xue et al., 2015) is the CoNLL shared task this year 1 which takes a piece of newswire text as input and returns all the discourse relations in the form of a discourse connective (explicit or implicit) taking two arguments (which can be clauses, sentences, or multisentence segments) in JSON format. A relation will be parsed as correct if the explicit discourse connective (e.g., "because", "however") once it has, the spans of text that serve as the two arguments for each discourse connective and the sense (e.g., "Comparison") are all correct. The F1 score of the parser's performance is the evaluation metric.
In this paper, we describe our system details in Section 2, the evaluation result and subsequent experiments in Section 3. Finally, we draw some conclusions in Section 4.

Resources
The resources used in our system are as follows: Labeled training and development data: The training and development (dev) data is derived from the PDTB 2.0 Section 2-21 and Section 22 in JSON format . There are 32535 relations and 1436 relations annotated in the training data and the dev data respectively.  Table 2: Distribution of the 15 senses from the different data sets. A.P, A.S, A.C are the abbreviations of "Asynchronous.Precedence", "Asynchronous.Succession", "Alternative.Chosen alternative" respectively .
15 valid senses including the second-level "types" as well as a selected number of third-level "subtypes". Table 2 shows the distribution of the 15 senses in the data. Test data: There are two test data sets. One is the blind set which contains 20,000 to 30,000 words of newswire text annotated following the PDTB annotation guidelines. The other test set is Section 23 of the PDTB which is used for comparison with previous work.
The connectives list: A list contains 100 discourse connectives in the PDTB and three syntactic categories form (Knott, 1996). Opennlp-maxent: We used the open source package Opennlp-maxent 2 to construct the classification models.

System overview and Features
Our system mainly follows the work of (Lin et al., 2014), which consists of two parts: the explicit relation parser and the non-explicit relation parser. The explicit relation parser is composed of the connective classifier, the argument position classifier, the argument extractor and the explicit sense classifier while the non-Explicit relation parser contains the AltLex classifier and the implicit classifier. The structure of our system is shown in Figure 1.
The set of features used in our system are listed in Table 3. All the features fall into four classes: lexical features, part-of-speech (POS) features, syntactic features and positional features.
• Lexical features: The lexical features (F1-F10) contain the connectives C, their contextual words and word-pair features (i.e., F7 (w i ,w j ) where w i is a word from Arg1 and w j is a word from Arg2) .
the position of C in the sentence Table 3: The features used in our system. "C" denotes the connectives. N means a current node in the constituent tree used in Section 2.3.2.
• Position features: F27 is the relative position in the syntactic tree structure (left, middle or right), while F28 is the connectives' positions in the sentence (start, middle or end).

The Connective Classifier
All the 100 connectives that appeared in one discourse were extracted whether it functioned as a connective or not. We converted all upper case letters in connective to lower case ones. The connective classifier decides whether a connective is functioned as a discourse connective.

The Argument Labeller
Once the connective is identified, the argument labeller identifies the Arg1 and Arg2 spans of this instance. This is accomplished in two steps: (1) Classifying the locations of Arg1 by the Argument Position Classifier. (2) Labelling the spans of Arg1 and Arg2 by the Argument Extrator.
The Argument Position Classifier: Normally Arg2 immediately follows the connective while the position of Arg1 is uncertain. In this model, we classified the Arg1's locations into two classes: Arg1 was located within the same sentence of the connective (SS) or in the previous sentence of connective (PS) (Prasad et al., 2008).
We implemented this as a binary classification task. In this step, features F1-F5, F11, F13, F16-F17, F28 in Table 3 were adopted to train the model. After the position label of Arg1 was determined, the result was passed to the argument extractor.
The Argument Extractor: In this module, our classifier labelled the previous sentence as Arg1 immediately for the PS case. The argument spans for the SS case were extracted described as below.
• Label a node as Arg1-node once its Arg1node predicted probability is greater than 0.1 (which is tuned on the dev data set).
• Select only one Arg1-node and one Arg2node in one instance with the maximal probability of the respective label.
• Extract the Arg1 and Arg2 spans by tree subtraction. If the Arg1 node is the ancestor of the Arg2 node, the span of Arg1 should be subtracted from the Arg2 span, and vice versa.
• Remove punctuation tokens and connectives out of the exact argument spans.

The Explicit Sense Classifier
After recognizing the discourse connective and its two arguments spans, the next step is to decide the   Table 3. We picked the output whose maximal sense probability is greater than 0.45 which was experientially determined on dev data set.

The AltLex classifier
We extracted all adjacent sentence pairs within each paragraph and removed the pairs that were identified by the explicit relation parser. Then we trained the AltLex Classifier which decided whether the pairs were AltLex pairs and classified the senses with features F8-F10 in Table 3. The pairs labelled as non-AltLex relations were passed to the next implicit relation classifier.

The Implicit relation classifier
The implicit relation classifier classified the sense of each pair into one of the 15 valid senses or NoRel with F7, F25-F26 in Table 3. After predicting, we kept the implicit discourse relations whose maximal sense probability were greater than a threshold (0.25 in our case) which was determined on the dev data set .

Experiments and Results
There are two test data sets this year as described in Section 2.1 and the organizers reported the results on the two test data sets and the dev data set. The results of our system obtained are shown in Table 4. We ranked the 10th on every data set. After the deadline of evaluation, we made some improvements in the module of implicit relation classifier inspired by (Lin et al., 2009). We selected the word-pair features (F7) while the experiments showed a little degradation in F1 score through selecting the constituent rules and the dependency rules (F25, F26) on the dev data set.
We computed the mutual information between each word-pair feature and the 15 valid senses and then selected the top N as the features. Table 5 shows the improvement of different N.

Conclusion
We divided the complex task of discourse parsing into a set of classification subtasks and glued them together. A variety of features, including lexical, part-of-speech, syntactic and positional feature were employed to train the baseline with open Maximum Entropy package, then the system was improved by setting probability-output threshold. We did not utilize any additional resources and only used the annotations the official provided.
Our system ranked the 10th among seventeenth teams on the two test data sets.