Finding Arguments as Sequence Labeling in Discourse Parsing

This paper describes our system for the CoNLL-2016 Shared Task on Shallow Discourse Parsing on English. We adopt a cascaded framework consisting of nine components, among which six are casted as sequence labeling tasks and the remaining three are treated as classiﬁcation problems. All our sequence labeling and clas-siﬁcation models are implemented based on linear models with averaged perceptron training. Our feature sets are mostly bor-rowed from previous works. The main focus of our effort is to recall cases when Arg1 locates at sentences far before the connective phrase, with some yet limited success.


General Description
This paper descirbes our participating system for CoNLL-2016 discourse parsing shared task (Xue et al., 2016). We participate in the closed track, and due to the time limitation, we focus on English. Given an document, which contains several paragraphs and each paragraph is composed of a few sentences, discourse parsing aims to identify explicit and non-explict discourse relations, including explicit connnective phrases (CP), explicit/non-explicit arguments and senses. Figure  1 presents a graphical illustration of the task.
Following the official requirement, we use Section 2-21 of the PDTB 2.0 (Prasad et al., 2008;Prasad et al., 2014) as the training data, Section 22 as the development data, and Section 23 as the test data. A blind test is also used for evaluation. Table 1 presents the data statistics.
Due to the complexity of the task, our system follows previous practice and employs a cas- * Correspondence author.  caded framework and comprises 9 components, as shown in Figure 2. In the following, we will introduce each component in detail. The codes are released at http://hlt.suda.edu.cn/ zhli for future research study.

Classification and Sequence Labeling Based on Linear Model
In this work, we implement our classification and sequence labeling models based on linear model due to its simplicity and good performance on variety of natural language processing tasks (Collins, 2002). Given an input instance x and a label y, a linear model defines the score of labeling x as y: where f (.) is a feature vector constructed according to a hand-crafted feature template list and w is the corresponding feature weight vector. The decoding task in the linear model is to find the maximum-scoring label: To learn w, we use the standard online training procedure, which use one instance for feature weight update at a time: where t is the global time of feature weight updates (i.e., the number of instances used for feature weight updates so far);ŷ is the best label according to the current feature weights w (t) ; y * is the gold-standard label. In this sense, online training is also known as decoding-based training, meaning that decoding is invoked during training.
Following Collins (2002), after training, we use the averaged feature weights T t=1 w (t) /T for final evaluation, which is known as averaged perceptron.
For sequence labeling tasks, y is a sequence of labels instead of a single label. Besides many unigram features which only consider the label in the current position, as used in multi-class classification tasks, we also use label-transition bigram features in our sequence labeling models. The training procedure is nearly the same with the case of classification problems, except that a dynamic programming based decoding algorithm is need for exact search for the optimal label sequenceŷ.

CP Identification
Given an input document, the first task is to extract all connective phrases (CPs) (e.g., "so that") in the document, 1 which we refer to as CP identification. We directly adopt the method described in previous works (Wang and Lan, 2015;Kong et al., 2015), and take two steps for this task.
1. Candidate CP extraction. We extract all candidate CPs in the input document by exact matching with a phrase dictionary. If a string in a sentence exactly matchs a phrase in the dictionary, it then is considered as a candidate CP and will be verified in the second step. The dictionary is provided by the official organizer and contains 100 phrases.
2. CP classification. In this step, we use a statistical classifier based on the linear model to check whether each candidate CP functions as a CP or not.
We directly borrow and merge the features proposed in Lin et al. (2014) and , as listed in Table 2. We spent little time on feature engineering, since we found our model achieved similar accuracy to last year's best system (Wang and Lan, 2015) using these features. On the dev data, our proposed CP identification method achieves 95.23% precison, 93.96% recall, and 94.59% F score. Figure 3 gives an example of the parse tree to better illustrate the features.
4 Explicit-Arg1 Sentence Locator: Sequence Labeling As far as we know, most previous participating systems last year assume that Arg1 lies in the same sentence or the previous sentence of CP. However, we find that there exist many cases that Arg1 locates at longer-distance sentences from the CP. Table 3 shows data statistics regarding the sentencelevel distance of Arg1 and CP. We also find that there are cases that Arg1 locates at more than one sentences, and the sentences may be discontinuous, as shown in Table  4. However, for simplicity, in this work we throw away training instances when Arg1 locates at more than one sentence.
2. None yes: the current sentence does not contain Arg1, but some sentence in its right does contain Arg1.
3. None no: the current sentence and all sentences in its right do not contain Arg1.
Using such label set, we can conveniently constrain the model not to return a sequence where Arg1 occurs more than once by constrained decoding. The idea is that during decoding we do not allow a set of illegal transitions: {Arg1 yes → Arg1 yes, Arg1 yes → None no, None yes → None no, None yes → Arg1 yes}.   (409) 197 (131) 34 (5) 12 (3) 6 (0) (444) 235 (162) ---- Table 6: Result analysis of different Explicit-Arg1 sentence locators on dev data. We report the distribution of the outputs of each model in terms of distance between the predicted sentence containing Arg1 and the sentence with CP, where numbers in parenthesis count correct prediction according to gold-standard answers.
As discussed in Section 2, our model is based on a linear model and uses online training to learn the feature weights. Moreover, online training is a decoding-based training procedure, meaning that a best result is found by the decoding procedure based on the current feature weights, and the result is then used for weight update. Therefore, we have three options for applying constrained decoding.
1. None: We do not use any constraints and apply post-processing to handle inconsistent outputs. When the model classifies multiple sentence into Arg1, we only keep the nearest sentence tagged as Arg1. If no sentence is tagged as Arg1, we use the sentence containing CP as Arg1.

Test:
We add constraints during the test phase. In the train phase, the optimalŷ is directly used for feature weight update without post-processing. However, we may also post-processŷ so that it contains exactly one Arg1 label before feature update weight during training, which we leave for future work.

Train & test:
We add constraints during both train and test phases.
For comparison, we also implement a model based on a two-tag label set of {Arg1, None}, in which we cannot guarantee the output label sequence always contains only one Arg1 through constrained decoding. Therefore, we post-process the results in the similar way to the case of the three-tag model with no constraint. Table 5 reports the results both with and without error propagation. The "PS/SS classification" model is our re-implementation of the method described in Wang and Lan (2015) under our linear model framework with only unigram features, which only considers the current and previous sentences of CP with a binary classifier. The three-tag model performs best with "test" constraints, and surprisingly worse with "train & test" constraints. Even though the "PS/SS classification" model is very simple, it is very competitive and achieves better results on the dev data than our proposed three-tag sequence labeling model. We will look into this issue in future. Table 6 further investigates the ability of different models on recalling cases when the sentence containing Arg1 locates far before the sentence containing CP. Although using "train & test" constraints leads to bad performance, we actually find that the model can actually recall cases when Arg1 locates at long-distance sentences, whereas the model with "test" constraints and the model with "none" constraints almost always return re-sults that Arg1 locates at the sentence with CP or the previous sentence. We will look into this problem in future.

Explicit-Arg1/2 Word Locator: Sequence Labeling
Data statistics show that for explicit relations, nearly all Arg2 locates at the the same sentence with CP. Therefore, based on the results of Arg1 sentence locator, we have two cases to handle: Arg1 and Arg2 locate at the same sentence with CP (SS), or Arg1 locates at a previous sentence of CP (PS). Then, we use three sequence labeling models to locate the exact words of Arg1/2. All three models perform at the level of words, and each time assign a "Arg1/Arg2/None" tag to a word. Many systems in CoNLL-2015 (Xue et al., 2015) evaluation also treat Arg1/2 word location as a sequence labeling problem, and uses conditional random filed (CRF) based models (Stepanov et al., 2015;Nguyen et al., 2015;Lalitha Devi et al., 2015) or recurrent neural networks (RNN) .

Explicit: SS Arg1/2 Word Locator
For the SS case, the sequence labeling model performs decoding from left to right on the CP sentence, and classifies each word into four categories: "Arg1/Arg2/None/CP ". The words inside the CP (given as input) are fixed to be "CP " before decoding, and all other words are not allowed to be tagged as "CP " during decoding. For the features, we directly adopt those described in Lin et al. (2014), Knott (1996, Kong et al. (2015). On the dev data, the model achieves an word-level accuracy of 53.45% without error propagation.

Explicit: PS Arg1 Word locator
For the PS case, we first use a sequence labeling model to locate the words of Arg1. The model perform decoding from left to right on the sentence returned by the Explicit-Arg1 sentence locator, and classifies each word into two categories: "Arg1/None". For the features, we directly adopt those described in Lin et al. (2014), ), Knott (1996. On the dev data, the model achieves an word-level accuracy of 67.14% without error propagation. Train  16940  4850  Dev  718  200   Table 7: Distribution of adjacent sentences having non-explicit relation.

Explicit: PS Arg2 Word Locator
To locate the Arg2 words in the PS case, we use a sequence labeling model to perform decoding from left to right on the CP sentence, and classifies each word into two categories: "Arg2/None". Please note that the words in CP always have a special tag "CP" when decoding. For the features, we directly adopt those described in Lin et al. (2014), , Wang and Lan (2015), Kong et al. (2015), Knott (1996). On the dev data, the model achieves an word-level accuracy of 67.14% without error propagation.

Explicit Sense Classification
After obtaining the CP and the Arg1/2 words, we then use a linear model based classifier to classify the sense of each explicit relation. We directly adopt the features described in Lin et al. (2014). On the dev data, the model achieves an accuracy of 87.65% without error propagation.

Non-explicit Sense Classification
After processing the explicit relations, we then turn to the problem of non-explicit relation parsing. As suggested by the official organizer, if two adjacent sentences do not have explicit relation after previous processing, we consider them as a candidate sentence pair having non-explicit relation. Please note that we only consider sentence pairs that are in the same paragraph. As far as we know, most previous work directly considers all adjacent sentences without explicit relation as having non-explicit relation, and use a classifier to predict their non-explicit senses. However, our data statistics in Table 7 show that there exist many false non-explicit cases, which we call negative instances. We add a special tag "None" into the non-explicit sense set and use such false non-explicit cases as negative training instances , so that the trained classifier can make not-a-non-explicit-relation decision. However, our preliminary results show that adding negative instances does not improve parser performance on  Table 8: Official results of our system on the dev, test, and blind test datasets. "All" means both explicit and non-explicit relations.
the dev data. We will look into this problem in future.
For the features, we directly adopt those described in Lin et al. (2014), , Rutherford and Xue (2014), Kong et al. (2015). On the dev data, the model achieves an accuracy of 34.04% without error propagation.
8 Non-explicit Arg1/2 Word Locator: Sequence Labeling According to data statistics, if two adjacent sentences have non-explicit relation, Arg1 locates at the first sentence while Arg2 locates at the second sentence. Therefore, we use two separate sequence labeling models to locate Arg1/2 words in the two sentences respectively. If the non-explicit sense is "EntRel ", we directly label the whole first sentence as Arg1 and the whole second sentence as Arg2, according to data statistics. For the features, we directly adopt those described in Lin et al. (2014), , Wang and Lan (2015), Kong et al. (2015). On the dev data, the two models achieve word-level accuracy of 68.14% on Arg1 and 75.82% on Arg2 without error propagation.

Explicit Sense Classification with a Maximum Entropy Model
After obtaining the evaluation results of all systems, we find that our system achieves clearly lower performances on sense classifications than other systems. Therefore, we replace the linear classification model with a log-linear maximum entropy model in the Explicit sense classification task. We use AdaGrad for deciding the feature update step (Duchi et al., 2011). Table 9 shows the results. We can see that using maximum entropy leads to large improvement. We then try to replace the linear model with the maximum entropy model in the CP classification task, but obtain very little gain, possibly because the accuracy is already very high with the linear model. We plan to use the maximum entropy model for non-explicit sense classification.

Conclusions and Future Work
So far, our approach is composed of too many components without any interaction. In the future, we would like to pursue two directions. First, we will try to design a more principled and unified framework so that tasks at different levels can influence each other. Second, we plan to try other machine learning techniques such as neural networks for better representing and modeling discourse-level information.