Building Context-aware Clause Representations for Situation Entity Type Classification

Capabilities to categorize a clause based on the type of situation entity (e.g., events, states and generic statements) the clause introduces to the discourse can benefit many NLP applications. Observing that the situation entity type of a clause depends on discourse functions the clause plays in a paragraph and the interpretation of discourse functions depends heavily on paragraph-wide contexts, we propose to build context-aware clause representations for predicting situation entity types of clauses. Specifically, we propose a hierarchical recurrent neural network model to read a whole paragraph at a time and jointly learn representations for all the clauses in the paragraph by extensively modeling context influences and inter-dependencies of clauses. Experimental results show that our model achieves the state-of-the-art performance for clause-level situation entity classification on the genre-rich MASC+Wiki corpus, which approaches human-level performance.


Introduction
Clauses in a paragraph play different discourse and pragmatic roles and have different aspectual properties (Smith, 1997;Verkuyl, 2013) accordingly. We aim to categorize a clause based on its aspectual property and more specifically, based on the type of Situation Entity (SE) 1 (e.g., events, states, generalizing statements and generic statements) the clause introduces to the discourse, following the recent work by (Friedrich et al., 2016). Understanding SE types of clauses is beneficial for many NLP tasks, including discourse mode identi-fication 2 (Smith, 2003(Smith, , 2005), text summarization, information extraction and question answering.
The situation entity type of a clause reflects discourse roles the clause plays in a paragraph and discourse role interpretation depends heavily on paragraph-wide contexts. Recently, Friedrich et al. (2016) used insightful syntactic-semantic features extracted from the target clause itself for SE type classification, which has achieved good performance across several genres when evaluated on the newly created large dataset MASC+Wiki. In addition, Friedrich et al. (2016) implemented a sequence labeling model with conditional random fields (CRF) (Lafferty et al., 2001) for finetuning a sequence of predicted SE types. However, other than leveraging common SE label patterns (e.g., GENERIC clauses tend to cluster together.), this approach largely ignored the wider contexts a clause appears in when predicting its SE type.
To further improve the performance and robustness of situation entity type classification, we argue that we should consider influences of wider contexts more extensively, not only by fine-tuning a sequence of SE type predictions, but also in deriving clause representations and obtaining precise individual SE type predictions. For example, we distinguish GENERIC statements from GENER-ALIZING statements depending on if a clause expresses general information over classes or kinds instead of specific individuals. We recognize the latter two clauses in the following paragraph as GENERALIZING because both clauses describe situations related to the Amazon river: (1): [Today, the Amazon river is experiencing a crisis of overfishing.] STATE [Both subsistence fishers and their commercial rivals compete in netting large quantities of pacu,] GENERALIZING [which bring good prices at markets in Brazil and abroad.] GENERALIZING If we ignore the wider context, the second clause can be wrongly recognized as GENERIC easily since "fishers" usually refer to one general class rather than specific individuals. However, considering the background introduced in first clause, "fishers" here actually refer to the fishers who fish on Amazon river which become specific individuals immediately.
Therefore, we aim to build context-aware clause representations dynamically which are informed by their paragraph-wide contexts. Specifically, we propose a hierarchical recurrent neural network model to read a whole paragraph at a time and jointly learn representations for all the clauses in the paragraph. Our paragraph-level model derive clause representations by modeling interdependencies between clauses within a paragraph. In order to further improve SE type classification performance, we also add an extra CRF layer at the top of our paragraph-level model to fine-tune a sequence of SE type predictions over clauses (Friedrich et al., 2016), which however is not our contribution.
Experimental results show that our paragraphlevel neural network model greatly improves the performance of SE type classification on the same MASC+Wiki (Friedrich et al., 2016) corpus and achieves robust performance close to human level. In addition, the CRF layer further improves the SE type classification results, but by a small margin. We hypothesize that situation entity type patterns across clauses may have been largely captured by allowing the preceding and following clauses to influence semantic representation building for a clause in the paragraph-level neural net model.

Linguistic Categories of SE Types
The situation entity types annotated in the MASC+Wiki corpus (Friedrich et al., 2016) were initially introduced by Smith (2003), which were then extended by (Palmer et al., 2007;Friedrich and Palmer, 2014b). The situation entity types can be divided into the following broad categories: • Eventualities (EVENT, STATE and RE-PORT): for clauses representing actual happenings and world states. STATE and EVENT are two fundamental aspectual classes of a clause (Siegel and McKeown, 2000) which can be distinguished by the semantic property of dynamism. REPORT is a subtype of EVENT for quoted speech.
• General Statives (GENERIC and GENER-ALIZING): for clauses that express general information over classes or kinds, or regularities related to specific main referents. The type GENERIC is for utterances describing a general class or kind rather than any specific individuals (e.g., People love dogs.). The type GENERALIZING is for habitual utterances that refer to ongoing actions or properties of specific individuals (e.g., Audubon educates the public.). • Speech Acts (QUESTION and IMPERA-TIVE): for clauses expressing two types of speech acts (Searle, 1969).

Situation Entity (SE) Type Classification
Although situation entities have been well-studied in linguistics, there were only several previous works focusing on data-driven SE type classification using computational methods. Palmer et al. (2007) first implemented a maximum entropy model for SE type classification relying on words, POS tags and some linguistic cues as main features. This work used a relatively small dataset (around 4300 clauses) and did not achieve satisfied performance (around 50% of accuracy).
To bridge the gap, Friedrich et al. (2016) created a much larger dataset MASC+Wiki (more than 40,000 clauses) and achieved better SE type classification performance (around 75% accuracy) by using rich features extracted from the target clause. The feature sets include POS tags, Brown cluster features, syntactic and semantic features of the main verb and main referent as well as features indicating the aspectual nature of a clause. Friedrich et al. (2016) further improved the performance by implementing a sequence labeling (CRF) model to fine-tune a sequence of SE type predictions and noted that much of the performance gain came from modeling the label pattern that GENERIC clauses often occur together. In contrast, we focus on deriving dynamic clause representations informed by paragraph-level contexts and model context influences more extensively. Becker et al. (2017) proposed a GRU based neural network model that predicts the SE type for one clause each time, by encoding the content of the target clause using a GRU and incorporating several sources of context information, includ-ing contents and labels of preceding clauses as well as genre information, using additional separate GRUs (Chung et al., 2014). This model is different from our approach that processes one paragraph (with a sequence of clauses) at a time and extensively models inter-dependencies of clauses.

Paragraph-level Sequence Labeling
Learning latent representations and predicting a sequence of labels from a long sequence of sentences (clauses), such as a paragraph, is a challenging task. Recently, various neural network models, including Convolution Neural Network (CNN) (Wang and Lu, 2017), Recurrent Neural Network (RNN) based models (Wang et al., 2015;Chiu and Nichols, 2016;Huang et al., 2015;Ma and Hovy, 2016;Lample et al., 2016) and Sequence to Sequence models (Vaswani et al., 2016;Zheng et al., 2017), have been applied to the general task of sequence labeling. Among them, the bidirectional LSTM (Bi-LSTM) model (Schuster and Paliwal, 1997) has been widely used to process a paragraph for applications such as language generation (Li et al., 2015), dialogue systems (Serban et al., 2016) and text summarization (Nallapati et al., 2016), because of its capabilities in modeling long-distance dependencies between words. In this work, we use two levels of Bi-LSTMs connected by a max-pooling layer to abstract clause representations by extensively modeling paragraph-wide contexts and inter-dependencies between clauses.

The Hierarchical Recurrent Neural Network for SE Type Classification
We design an unified neural network to extensively model word-level dependencies as well as clause-level dependencies in deriving clause representations for SE type prediction. Figure 1 shows the architecture of the proposed paragraphlevel neural network model which includes two Bi-LSTM layers, one max-pooling layer in between and one final softmax prediction layer. Given the word sequence of one paragraph as input, the word-level Bi-LSTM will firstly generate a sequence of hidden states as word representa-tions, then a max-pooling layer will be applied to abstract clause embeddings from word representations within a clause. Next, another clause-level Bi-LSTM will run over the sequence of clause embeddings and derive final clause representations by further modeling semantic dependencies between clauses within a paragraph. The softmax prediction layer will then predict a sequence of situation entity (SE) types with one label for each clause, based on the final clause representations.
Word Vectors: To transform the one-hot representation of each word into its distributed word vector , we used the pretrained 300-dimension Google English word2vec embeddings 3 . For the words which are not included in the vocabulary of Google word2vec, we randomly initialize their word vectors with each dimension sampled from the range [−0.25, 0.25].
For situation entity type classification, it is important to recognize certain types of words such as punctuation marks (e.g., "?" for QUESTION and "!" for IMPERATIVE) as well as entities such as locations and time values. We therefore created feature-rich word vectors by concatenating word embeddings with parts-of-speech (POS) tag and named-entity (NE) tag one-hot embeddings 4 .
Deriving Clause Representations: In designing the model, we focus on building clause representations that sufficiently leverage cues from paragraph-wide contexts for SE type prediction, including both preceding and following clauses in a paragraph. To process long paragraphs which may contain a number of clauses, we utilize a twolevel bottom-up abstraction approach and progressively obtain the compositional representation of each word (low-level) and then compute a compositional representation of each clause (high-level), with a max-pooling layer in between.
At both word-level and clause-level, we choose the Bi-LSTM as our basic neural net component for representation learning, mainly considering its ability to capture long-distance dependencies between words (clauses) and to integrate influences of context words (clauses) from both directions.
Given a word sequence X = (x 1 , x 2 , ..., x L ) in a paragraph as the input, the word-level Bi-LSTM will process the input paragraph by using two separate LSTMs, one processes the word sequence from the left to right while the other processes the sequence from the right to left. Therefore, at each word position t, we obtain two hidden states − → h t , ← − h t and concatenate them to get the word Then we apply the max-pooling operation over the sequence of word representations for words within a clause in order to get the initial clause embedding: where, 1 ≤ j ≤ hidden unit size (2) Next, the clause-level Bi-LSTM will process the sequence of initial clause embeddings in a paragraph and generate refined hidden states − −−−− → h Clause t and ← −−−− − h Clause t at each clause position t. Then, we concatenate the two hidden states for a clause to get the final clause representation Situation Entity Type Classification: Finally, the prediction layer will predict the situation entity type for each clause by applying the softmax function to its clause representation:

Fine-tune Situation Entity Predictions with a CRF Layer
Previous studies (Friedrich et al., 2016;Becker et al., 2017) show that there exist common SE label patterns between adjacent clauses. For example, Friedrich et al. (2016) reported the fact that GENERIC sentences usually occur together in a paragraph. Following (Friedrich et al., 2016), in order to capture SE label patterns in our hierarchical recurrent neural network model, we add a CRF layer at the top of the softmax prediction layer (shown in figure 2) to fine-tune predicted situation entity types. The CRF layer will update a state-transition matrix, which can effectively adjust the current label depending on its preceding and following labels. Both the training and decoding procedures of the CRF layer can be conducted efficiently using the Viterbi algorithm. With the CRF layer, the model jointly assigns a sequence of SE labels, one label per clause, by considering individual clause representations as well as common SE label patterns.

Parameter Settings and Model Training
We finalized hyperparameters based on the best performance with 10-fold cross-validation on the training set. The word vectors were fixed during model training. Both word representations and clause representations in the model are of 300 dimensions, and all the Bi-LSTM layers contain 300 hidden units as well. To avoid overfitting, we applied dropout mechanism (Hinton et al., 2012) with dropout rate of 0.5 to both input and output vectors of Bi-LSTM layers. To deal with the exploding gradient problem in LSTMs training, we utilized gradient clipping (Pascanu et al., 2013) with gradient L2-norm threshold of 5.0 and used L2 regularization with λ = 10 −4 simultaneously. These parameters remained the same for all our proposed models including our own baseline models.
We chose the standard cross-entropy loss function for training our neural network models and adopted Adam (Kingma and Ba, 2014) optimizer with the initial learning rate of 0.001 and the batch size 5 of 128. All our proposed models were implemented with Pytorch 6 and converged to the best result within 40 epochs. Note that to diminish the effects of randomness in training neural network models and report stable experimental results, we ran each of the proposed models as well as our own baseline models ten times and reported the averaged performance across the ten runs.

Dataset and Preprocessing
The MASC+Wiki Corpus: We evaluated our neural network model on the MASC+Wiki corpus 7 (Friedrich et al., 2016), which contains more 5 Counted as the number of SEs rather than paragraph instances.  (Friedrich et al., 2016), texts were split into clauses using SPADE (Soricut and Marcu, 2003). There are 4,784 paragraphs in total in the corpus; and on average, each paragraph contains 9.6 clauses. In figure 4, the horizontal axis shows the distribution of paragraphs based on the number of clauses in a paragraph. The annotations of clauses are stored in separate files from the text files. To recover the paragraph contexts for each clause, we matched its content with the corresponding raw document.

Systems for Comparisons
We compare the performance of our neural network model with two recent SE type classification models on the MASC+Wiki corpus as well as humans' performance (upper bound).
• CRF (Friedrich et al., 2016): a CRF model that relies heavily on features extracted from the target clause itself. • GRU (Becker et al., 2017): a GRU based neural network model that incorporates context information by using separate GRU units and predicts the SE type for one clause each time. • Humans (Friedrich et al., 2016): one annotator's performance when using two other an-

Model
Macro Acc CRF (Friedrich et al., 2016) 69.3 74.7 GRU (Becker et al., 2017) 68.  notators' annotation as "gold labels". It has been reported that labeling SE types is a nontrivial task even for humans.
In addition, we implemented a clause-level Bi-LSTM model as our own baseline, which takes a single clause as its input. Since there is only one clause, the upper Bi-LSTM layer shown in Figure  1 is meaningless and removed in the clause-level Bi-LSTM model.

Experimental Results
Following the previous work (Friedrich et al., 2016) on the same task and dataset, we report accuracy and macro-average F1-score across SE types on the test set of MASC+Wiki.
The first section of Table 3 shows the results of the previous works. The second section shows the result of our implemented clause-level Bi-LSTM baseline, which already outperforms the previous best model. This result proves the effectiveness of the Bi-LSTM + max pooling approach in clause representation learning (Conneau et al., 2017). The third section reports the performance of the paragraph-level models that uses paragraph-wide contexts as input. Compared with the baseline clause-level Bi-LSTM model, the basic paragraphlevel model achieves 3.5% and 3.3% of performance gains in macro-average F1-score and ac-curacy respectively. Building on top of the basic paragraph-level model, the CRF layer further improves the SE type prediction performance slightly by 0.4% and 0.7% in macro-average F1-score and accuracy respectively. Therefore, our full model with the CRF layer achieves the state-of-the-art performance on the MASC+Wiki corpus.

10-Fold Cross-Validation
We noticed that the previous work (Friedrich et al., 2016) did not publish the class-wise performance of their model on the test set, instead, they reported the detailed performance on the training set using 10-fold cross-validation. For direct comparisons, we also report our 10-fold cross-validation results 8 on the training set of MASC+Wiki. Table 2 reports the cross-validation classification results. Consistently, our clause-level baseline model already outperforms the previous best model. By exploiting paragraph-wide contexts, the basic paragraph-level model obtains consistent performance improvements across all the classes compared with the baseline clause-level prediction model, especially for the classes GENERIC and GENERALIZING, where the improvements are significant. After using the CRF layer to fine-tune the predicted SE label sequence, slight performance improvements were observed on the four small classes. Overall, the full paragraphlevel neural network model achieves the best macro-average F1-score of 77.8% in predicting SE types, which not only outperforms all previous approaches but also reaches human-like performance on some classes.

Model
Macro Acc STA EVE REP GENI GENA QUE IMP CRF (Friedrich et al., 2016) 66  Table 4: Cross-genre Classification Results on the Training Set of MASC+Wiki. We report accuracy (Acc), macro-average F1-score (Macro) and class-wise F1 scores.

Impact of Genre
Considering that MASC+Wiki is rich in written genres, we additionally conduct cross-genre classification experiments, where we use one genre of documents for testing and the other genres of documents for training. The purpose of cross-genre experiment is to see whether the model can work robustly across genres. Table 4 shows cross-genre experimental results of our neural network models on the training set of MASC+Wiki by treating each genre as one crossvalidation fold. As we expected, both the macroaverage F1-score and class-wise F1 scores are lower compared with the results in Table 2 where in-genre data were used for model training as well. But the performance drop on the paragraph-level models is little, which clearly outperform the previous system (Friedrich et al., 2016) and the baseline model by a large margin. As shown in Table 5, benefited from modeling wider contexts and common SE label patterns, our full paragraphlevel model improves performance across almost all the genres. The high performance in the crossgenre setting demonstrates the robustness of our paragraph-level model across genres.  Paragraph-level Model + CRF. We report macroaverage F1-score for each genre.

Impact of Training Data Size
In order to understand how much training data is required to train the paragraph-level model and obtain a good performance for SE type classification, we plot the learning curve shown in Figure 3 by training the full model several times using an increasing amount of training data. The classification performance increased quickly before the amount of training data was increased to 30% of the full training set; then the learning curve starts to become saturated afterwards. We conclude that the paragraph-level model can achieve a high performance quickly without requiring a large amount of training data.

Impact of Paragraph Length
To study the influence of paragraph lengths to the performance of the paragraph-level models, we report the performance of our proposed models on subsets of the test set, with paragraphs divided based on the number of clauses in a para- Figure 4: Impact of Paragraph Lengths. We plot the macro-average F1-score for each paragraph length.
graph. The histogram in Figure 4 compares performance of the two paragraph-level models and the baseline model. Note that the last bucket (paragraphs containing ten or more clauses) of the histogram is especially large and contains over 30% of all the paragraphs in the test set. Clearly, the paragraph-level model greatly outperforms the baseline clause-level model on paragraphs containing more than 6 clauses, which covers over 50% of the test set. Adding the CRF layer further improves the performance of the paragraphlevel model on long paragraphs (with 10 or more clauses), while the influences to the performance are mixed on short paragraphs. Therefore, it is beneficial to model wider paragraph-level contexts and inter-dependencies between clauses for situation entity type classification, especially when processing long paragraphs.

Impact of Discourse Connective Phrases
As one aspect of modeling context influences and clause inter-dependencies in SE type identification, we investigated the role of discourse connective phrases in determining the SE type of clauses they connect. Our assumption is that discourse connectives are important to glue clauses together and removing them affects text coherence and information flow between clauses. Intuitively, the connective "and" may occur between two clauses with the same SE type; "for example" may indicate that the following clause is not GENERIC. Therefore, we designed a pilot experiment to see whether discourse connective phrases are indispensable in building clause representations.
In this pilot experiment, we extracted a list of 100 explicit discourse connectives. PDTB corpus (Prasad et al., 2008) and identified clauses that start with a discourse connecte 9 . Then we ran the full paragraph-level model with one modification, i.e., disregarding words in connective phrases when conducting the max-pooling operation in equation (1), thus we did not consider discourse connective phrases directly when building a clause representation.
As shown in Table 6, for clauses containing a discourse connective phrase, both macro-average F1-score and accuracy dropped due to the exclusion of discourse connective phrases. The performance was negatively influenced across all the SE types except the type of QUESTION and IMPER-ATIVE 10 . The performance decreases on three SE types, REPORT, GENERIC and GENERALIZ-ING, are noticeable. To some extent, this pilot study shows that modeling text coherence and the overall discourse structure of a paragraph is important in situation entity type classification. Table 7 reports the confusion matrix of the full model on the training set of MASC+Wiki with cross-validation. We can see that the four situation entity types, including two eventualities (STATE and EVENT) and two general sta- 9 We found that 20.6% of clauses in the MASC+Wiki corpus contain a discourse connective phrase. 10 A possible explanation is that recognizing QUESTION (IMPERATIVE) clauses mainly relies on seeing certain punctuation marks and key words, such as "?" ("!") and "why" ("please"), which are independent from discourse connectives.