Discourse as a Function of Event: Profiling Discourse Structure in News Articles around the Main Event

Understanding discourse structures of news articles is vital to effectively contextualize the occurrence of a news event. To enable computational modeling of news structures, we apply an existing theory of functional discourse structure for news articles that revolves around the main event and create a human-annotated corpus of 802 documents spanning over four domains and three media sources. Next, we propose several document-level neural-network models to automatically construct news content structures. Finally, we demonstrate that incorporating system predicted news structures yields new state-of-the-art performance for event coreference resolution. The news documents we annotated are openly available and the annotations are publicly released for future research.


Introduction
Detecting and incorporating discourse structures is important for achieving text-level language understanding. Several well-studied discourse analysis tasks, such as RST (Mann and Thompson, 1988) and PDTB style (Prasad et al., 2008) discourse parsing and text segmentation (Hearst, 1994), generate rhetorical and content structures that have been shown useful for many NLP applications. But these widely applicable discourse structures overlook genre specialties. In this paper, we focus on studying content structures specific to news articles, a broadly studied text genre for many NLP tasks and applications. We believe that genre-specific discourse structures can effectively complement genre independent discourse structures and are essential for achieving deep story-level text understanding.
What is in a news article? Normally, we expect a news article to describe well verified facts of newly happened events, aka the main events. However, almost no news article limits itself to reporting only the main events. Most news articles also report context-informing contents, including recent precursor events and current general circumstances, that are meant to directly explain the cause or the context of main events. In addition, they often contain sentences providing further supportive information that is arguably less relevant to main events, comprising of unverifiable or hypothetical anecdotal facts, opinionated statements, future projections and historical backgrounds. Apparently, the relevance order of sentences is not always aligned with their textual order, considering that sentences in a news article are ordered based on their vague importance that is generally determined by multiple factors, including content relevance as well as other factors such as the focus of an article, the author's preferences and writing strategies.
While a number of theoretical studies for news discourse exist, little prior effort has been put on computational modeling and automatic construction of news content structures. We introduce a new task and a new annotated text corpus for profiling news discourse structure that categorizes contents of news articles around the main event. The NewsDiscourse corpus consists of 802 news articles (containing 18,155 sentences), sampled from three news sources (NYT, Xinhua and Reuters), and covering four domains (business, crime, disaster and politics). In this corpus, we label each sentence with one of eight content types reflecting common discourse roles of a sentence in telling a news story, following the news content schemata proposed by Van Dijk (Teun A, 1986;Van Dijk, 1988a,b) with several minor modifications.
Next, we present several baselines for automatically identifying the content type of sentences. The experimental results show that a decent performance can be obtained using a basic neural network-based multi-way classification approach. The sentence classification performance can be further improved by modeling interactions between sentences in a document and identifying sentence types in reference to the main event of a document.
We envision that the news discourse profiling dataset as well as the learnt computational systems are useful to many discourse level NLP tasks and applications. As an example, we analyze correlations between content structures and event coreference structures in news articles, and conduct experiments to incorporate system predicted sentence content types into an event coreference resolution system. Specifically, we analyze the lifespan and spread of event coreference chains over different content types, and design constraints to capture several prominent observations for event coreference resolution. Experimental results show that news discourse profiling enables consistent performance gains across all the evaluation metrics on two benchmark datasets, improving the previous best performance for the challenging task of event coreference resolution.

Related Work
Several well-studied discourse analysis tasks have been shown useful for many NLP applications. The RST (Mann and Thompson, 1988;Soricut and Marcu, 2003;Feng and Hirst, 2012;Ji and Eisenstein, 2014;Li et al., 2014a;Liu et al., 2019) and PDTB style (Prasad et al., 2008;Pitler and Nenkova, 2009;Lin et al., 2014;Rutherford and Xue, 2016;Qin et al., 2016;Xu et al., 2018) discourse parsing tasks identify discourse units that are logically connected with a predefined set of rhetorical relations, and have been shown useful for a range of NLP applications such as text quality assessment (Lin et al., 2011), sentiment analysis (Bhatia et al., 2015), text summarization (Louis et al., 2010), machine translation (Li et al., 2014b) and text categorization (Ji and Smith, 2017). Text segmentation (Hearst, 1994;Choi, 2000;Eisenstein and Barzilay, 2008;Koshorek et al., 2018) is another well studied discourse analysis task that aims to divide a text into a sequence of topically coherent segments and has been shown useful for text summarization (Barzilay and Lee, 2004), sentiment analysis (Sauper et al., 2010) and dialogue systems (Shi et al., 2019). The news discourse profiling task is complementary to the well-established discourse analysis tasks and is likely to further benefit many NLP applications. First, it studies genre-specific discourse structures, while the aforementioned discourse analysis tasks study genre independent general discourse structures and thus fail to incorporate domain knowledge. Second, it focuses on understanding global content organization structures with the main event at the center, while the existing tasks focus on either understanding rhetorical aspects of discourse structures (RST and PDTB discourse parsing) or detecting shallow topic transition structures (text segmentation).
Genre-specific functional structures have been studied based on different attributes, but mostly for genres other than news articles. Liddy (1991), Kircz (1991) and Teufel et al. (1999) used rhetorical status and argumentation type to both define functional theories and create corpora for scientific articles. Mizuta et al. (2006), Wilbur et al. (2006), Waard et al. (2009) andLiakata et al. (2012) extensively studied functional structures in biological domain with multiple new annotation schemata.
Past studies on functional structures of news articles have been mainly theoretical. Apart from Van Dijk's theory of news discourse (Teun A, 1986;Van Dijk, 1988b), Pan and Kosicki (1993) proposed framing-based approach along four structural dimensions: syntactic, script, thematic and rhetorical, of which syntactic structure is similar to the Dijk's theory. Owing to the high specificity of the Dijk's theory, Yarlott et al. (2018) performed a pilot study for its computational feasibility and annotated a small dataset of 50 documents taken from the ACE Phase 2 corpus (Doddington et al., 2004). However, as mentioned in the paper, their annotators were given minimal training prior to annotations, consequently, the kappa inter-agreement (55%) between two annotators was not satisfactory. In addition, coverage of their annotated dataset on broad event domains and media sources was unclear. The only studies on functional structure of news article with sizable dataset include Baiamonte et al. (2016) that coarsely separates narration from descriptive contents and Friedrich and Palmer (2014) that classify clauses based on their aspectual property.

Elements of Discourse Profiling
We consider sentences to be units of discourseand define eight schematic categories to study their roles within the context of the underlying topic. The original Van Dijk's theory was designed for

Main Content
Fine-grained type (1) U.S. President Donald Trump tried on Tuesday to calm a storm over his failure to hold Russian President Vladimir Putin accountable for meddling in the 2016 U.S. election, saying he misspoke in a joint news conference in Helsinki.

Main Event
(2) The rouble fell 1.2 percent on Tuesday following Trump's statement. Consequence

Context-informing Content
Fine-grained type (3) Trump praised the Russian leader for his "strong and powerful" denial of the conclusions of U.S. intelligence agencies that the Russian state meddled in the election.

Previous Event
(4) Special Counsel Robert Mueller is investigating that allegation and any possible collusion by Trump's campaign.

Additional Supportive Content
Fine-grained type (5) Congress passed a sanctions law last year targeting Moscow for election meddling.
Historical Event (6) "The threat of wider sanctions has grown," a businessman told Reuters, declining to be named because of the subject's sensitivity.
Anecdotal Event (7) Republicans and Democrats accused him of siding with an adversary rather than his own country. Evaluation (8) McConnell and House Speaker Paul Ryan, who called Russia's government "menacing," said their chambers could consider additional sanctions on Russia. analyzing discourse functions of individual paragraphs w.r.t the main event, and the pilot study done by Yarlott et al. (2018) also considered paragraphs as units of annotations. Observing that some paragraphs contain more than one type of contents, we decided to conduct sentence-level annotations instead to minimize disagreements between annotators. and allow consistent annotations 2 . Table 1 contains an example for each content type. Consistent with the theory presented by Van Dijk, the categories are theoretical and some of them may not occur in every news article.

Main Contents
Main content describes what the text is about, the most relevant information of the news article. It describes the most prominent event and its consequences that render the highest level topic of the news report. Main Event (M1) introduces the most important event and relates to the major subjects in a news report. It follows strict constraints of being the most recent and relevant event, and directly monitors the processing of remaining document. Categories of all other sentences in the document are interpreted with respect to the main event. Consequence (M2) informs about the events that are triggered by the main news event. They are either temporally overlapped with the main event or happens immediately after the main event.
2 Our two annotators agreed that the majority of sentences describe one type of content. For a small number of sentences that contain a mixture of contents, we ask our annotators to assign the label that reflects the main discourse role of a sentence in the bigger context.

Context-informing Contents
Context-informing sentences provide information related to the actual situation in which main event occurred. It includes the previous events and other contextual facts that directly explain the circumstances that led to the main event. Previous Event (C1) describes the real events that preceded the main event and now act as possible causes or preconditions for the main event. They are restricted to events that have occurred very recently, within last few weeks. Current Context (C2) covers all the information that provides context for the main event. They are mainly used to activate the situation model of current events and states that help to understand the main event in the current social or political construct. They have temporal co-occurrence with the main event or describe the ongoing situation.

Additional Supportive Contents
Finally, sentences containing the least relevant information, comprising of unverifiable or hypothetical facts, opinionated statements, future projections and historical backgrounds, are classified as distantly-related content. Historical Event (D1) temporally precedes the main event in months or years. It constitutes the past events that may have led to the current situation, or indirectly relates to the main event or subjects of the news article. Anecdotal Event (D2) includes events with specific participants that are difficult to verify. It may include fictional situations or personal account of incidents of an unknown person especially aimed to exaggerate the situation. Evaluation (D3) introduces reactions from immediate participants, ex-perts or known personalities that are opinionated and may also include explicit opinions of the author or those of the news source. They are often meant to describe the social or political implications of the main event or evaluation of the current situation. Typically, it uses statements from influential people to selectively emphasize on their viewpoints. Expectation (D4) speculates on the possible consequences of the main or contextual events. They are essentially opinions, but with far stronger implications where the author tries to evaluate the current situation by projecting possible future events.

Speech vs. Not Speech
In parallel with discourse profiling annotations, we also identify sentences that contain direct quotes or paraphrased comments stated directly by a human and label them as Speech. We assign a binary label, Speech vs. Not Speech, to each sentence independently from the annotations of the above eight schematic discourse roles. Note that Speech sentences may perfectly be annotated with any of the eight news discourse roles based on their contents, although we expect Speech sentences to serve certain discourse roles more often, such as evaluation and expectation.

Modifications to the Van Dijk Theory
The Van Dijk's theory was originally based on case studies of specific news reports. To accommodate wider settings covering different news domains and sources, we made several minor modifications to the original theory. First, we label both comments made by external sources (labeled as "verbal reactions" in the original theory) and comments made by journalistic entities as speech, and label speech with content types as well. Second, we added a new category, anecdotal event (D2), to distinguish unverifiable anecdotal facts from other contents. Anecdotal facts are quite prevalent in the print media. Third, we do not distinguish news lead sentences that summarize the main story from other Main Event (M1) sentences, considering that lead sentences pertain to the main event and major subjects of a news.

Dataset Creation and Statistics
The NewsDiscourse corpus consists of 802 openly accessible news articles containing 18,155 sentences 3 annotated with one of the eight content 3 Note that only sentences within the body of the news article are considered for annotation and headlines are considered types or N/A (sentences that do not contribute to the discourse structure such as photo captions, text links for images, etc.) as well as Speech labels.The documents span across the domains of business, crime, disaster and politics from three major news sources that report global news and are widely used: NYT (USA), Reuters (Europe) and Xinhua (China). We include 300 articles each (75 per domain) from Reuters and Xinhua that are collected by crawling the web and cover news events between 2018-'19. NYT documents are taken from existing corpora, including 102 documents from KBP 2015 4 (Ellis et al., 2015) and 100 documents (25 per domain) from the annotated NYT corpus (Evan, 2008).
We trained two annotators for multiple iterations before we started the official annotations. In the beginning, each annotator completed 100 common documents (Eight from each of the domains and sources and four from the KBP) within the corpus to measure annotator's agreement. The two annotators achieved Cohen's κ score (Cohen, 1968) of 0.69144,0.72389 and 0.87525 for the eight finegrained, three coarse-grained and Speech label annotations respectively. Then, the remaining documents from each domain and news source were split evenly between the two annotators.
Detailed distributions of the created corpus, including distributions of different content types across domains and media sources are reported in Tables 2 and 3 respectively. We find that distributions of content types vary depending on either domains or media sources. For instance, disaster documents report more consequences (M2) and anecdotal events (D2), crime documents contain more previous events (C1) and historical events (D1), while politics documents have the most opinionated contents (sentences in categories D3 and D4) immediately followed by business documents. Furthermore, among different sources, NYT articles are the most opinionated and describe historical events most often, followed by Reuters. In contrast, Xinhua articles has relatively more sentences describing the main event.
Speech labels and content type labels are separately annotated and each sentence has both a content type label and a speech label (binary, speech as independent content. We used NLTK (Bird et al., 2009) to identify sentence boundaries in the body text. Occasionally, one sentence is wrongly split into multiple sentences, the annotators were instructed to assign them with the same label. 4 KBP documents are not filtered for different domains due to the small size of corpus.   As an initial attempt, we use a hierarchical neural network to derive sentence representations and a document encoding, and model associations between each sentence and the main topic of the document when determining content types for sentences. Shown in Figure 1, it first uses a wordlevel bi-LSTM layer (Hochreiter and Schmidhuber, 1997) with soft-attention over word representations to generate intermediate sentence representations which are further enriched with the context information using another sentence-level bi-LSTM. Enriched sentence representations are then averaged with their soft-attention weights to generate document encoding. The final prediction layers model associations between the document encoding and each sentence encoding to predict sentence types.
Context-aware sentence encoding: Let a document be a sequence of sentences {s 1 , s 2 ..s n }, which in turn are sequences of words {(w 11 , w 12 ..) .. (w n1 , w n2 , ..)}. We first transform a sequence of words in each sentence to contextualized word representations using ELMo (Peters et al., 2018) followed by a word-level biLSTM layer to obtain their hidden state representations H s . Then, we take weighted sums of hidden representations using soft-attention scores to obtain intermediate sen- tence encodings (S i ) that are uninformed of the contextual information. Therefore, we apply another sentence-level biLSTM over the sequence of sentence encodings to model interactions among sentences and smoothen context flow from the headline until the last sentence in a document. The hidden states (H t ) of the sentence-level bi-LSTM are used as sentence encodings. Document Encoding: We generate a reference document encoding, as a weighted sum over sentence encodings using their soft-attention weights.
Modeling associations with the main topic: Sentence types are interpreted with respect to the main event. However, while the sentence-level biLSTM augments sentence representations with the local context, they may be still unaware of the main topic. Therefore, we compute element-wise products and differences between the document encoding and a sentence encoding to measure their correlations, and further concatenate the products and differ-  Table 4: Performance of different systems on fine-grained discourse content type classification task. All results correspond to average of 10 training runs with random seeds. In addition, we report standard deviation for both macro and micro F1 scores.
ences with the sentence encoding to obtain the final sentence representation that is used for predicting its sentence type. Predicting Sentence Types: First, we use a two layer feed forward neural network as a regular classifier to make local decisions for each sentence based on the final sentence representations. In addition, news articles are known to follow inverted pyramid (Bell, 1998) or other commonly used styles where the output labels are not independent. Therefore, we also use a linear chain CRF (Lafferty et al., 2001) layer on the output scores of the local classifier to model dependence among discourse labels.

Evaluation
We split  Basic Classifier uses only the word-level bi-LSTM with soft-attention to learn sentence representations followed by the local feed forward neural network classifier to make content type predictions.

Proposed Document-level Models
Document LSTM adds the sentence-level BiL-STM over sentence representations obtained from the word-level BiLSTM to enrich sentence representations with local contextual information. +Document Encoding uses document encoding for modeling associations with the main topic and obtains the final sentence representations as described previously. +Headline replaces document encoding with headline sentence encoding generated from the wordlevel biLSTM. Headline is known to be a strong predictor for the main event (Choubey et al., 2018). CRF Fine-grained and CRF Coarse-grained adds a CRF layer to make content type predictions for sentences which models dependencies among fine-grained (eight content types) and coarse-grained (main vs. context-informing vs. supportive contents) content types respectively.

Implementation Details
We set hidden states dimension to 512 for both word-level and sentence-level biLSTMs in all our models. Similarly, we use two-layered feed forward networks with 1024-512-1 units to calculate attention weights for both the BiLSTMs. The final classifier uses two-layer feed forward networks with 3072-1024-9 units for predicting sentence types. All models are trained using Adam (Kingma and Ba, 2014) optimizer with the learning rate of 5e-5. For regularization, we use dropout (Srivastava et al., 2014) of 0.5 on the output activations  To alleviate the influence of randomness in neural model training and obtain stable experimental results, we run each neural model ten times with random seeds and report the average performance.

Results and Analysis
Tables 4 and 5 show the results from our experiments for content-type and speech label classification tasks. We see that a simple word-level biLSTM based basic classifier outperforms features-based SVM classifier (Yarlott et al., 2018) by 10.5% and 11.8% on macro and micro F1 scores respectively for content-type classification. Adding a sentencelevel BiLSTM helps in modeling contextual continuum and improves performance by additional 4.4% on macro and 2.7% on micro F1 scores. Also, as content types are interpreted with respect to the main event, modeling associations between a sentence representation and the referred main topic representation using headline or document embeddings improves averaged macro F1 score by 0.6% and 1.2% respectively. Empirically, the model using document embedding performs better than the one with headline embedding by 0.6% implying skewed headlining based on recency which is quite prevalent in news reporting.
We further aim to improve the performance by using CRF models to capture interdependencies among different content types, however, CRF models using both fine-grained and coarse-grained label transitions could not exceed a simple classifier model. The inferior performance of CRF models can be explained by variations in news content organization structures (such as inverted pyramid, narrative, etc.), further implying the need to model those variations separately in future work.
Similarly, for speech label classification task, word-level biLSTM model achieves 12.2% higher F1 score compared to the feature-based SVM classifier which is further improved by 1.0% with   We envision that news discourse profiling can be useful to many discourse level NLP tasks and applications. As an example, we investigate uses of news structures for event coreference resolution by analyzing 102 documents from the KBP 2015 corpus included in our NewsDiscourse Corpus. We analyze the lifespan and spread of event coreference chains over different content types. First, table 7 shows the percentage of events that are singletons out of all the events that appear in sentences of each content type. We can see that in contrast to main event sentences (M1), other types of sentences are more likely to contain singleton events.
We further analyze characteristics of nonsingleton events, to identify positions of their coreferential mentions and the spread of coreference chains in a document. Motivated by van Dijk's theory, we hypothesize that the main events appear in each type of sentences, but the likelihoods of   Table 9: Percentages of Intra-type events out of nonsingleton events in sentences of each content type seeing the main events in a sentence may vary depending on the sentence type. We consider events that appear in the news headline to approximate the main events of a news article. As shown in Table 8, around 58% 5 of main event sentences (M1) contain at least one headline event, in addition, context-informing sentences (C1+C2), especially sentences focusing on discussing recent pre-cursor events (C1), are more likely to mention headline events as well.
Other than the main events, we observe that many events have all of their coreferential mentions appear within sentences of the same content type. We call such events intra-type events. In other words, an intra-type event chain starts from a sentence of any type will die out within sentences of the same content type. Table 9 shows the percentage of intra-type event chains out of all the event chains that begin in a certain type of sentence. We can see that non-main contents (e.g., content types C2-D3) are more likely to be self-contained from introducing to finishing describing an event. In particular, historical (D1) and anecdotal (D2) contents exhibit an even stronger tendency of having intratype event repetitions compared to other non-main content types.
Incorporating Content Structure for Event Coreference Resolution: We incorporate news functional structures for event coreference resolution by following the above analysis and implementing content structure informed constraints in 5 While all the main event sentences are expected to mention some main event, we use headline events to approximate main events and headline events do not cover all the main events of a news article. As shown in our previous work (Choubey et al., 2018), identifying main events is a challenging task in its own right and main events do not always occur in the headline of a news article. In addition, event annotations in the KBP corpora only consider a limited set of event types, seven types specifically, therefore, if main events do not belong to those seven types, they are not annotated as events, which also contributes to the imperfect percentage of main event sentences containing a headline event.
an Integer Linear Programming (ILP) inference system to better identify singleton mentions, main event mentions and intra-type event mentions.
We use the Document LSTM+Document encoding classifier to predict sentence content types. In addition, we built a discourse-aware event singleton classifier, that resembles the sentence type classifier, to identify singleton event mentions in a document. Specifically, the singleton classifier combines document and sentence representations provided by the content type classifier with contextualized event word representations obtained from a separate word-level biLSTM layer with 512 hidden units. Then, the singleton classifier applies a two-layer feed forward neural network to identify event singletons, and the feed forward network has 3072-512-2 units.
We implement ILP constraints based on system predicted content types of sentences and singleton scores of event mentions. Detailed descriptions of ILP constraints we implemented and their equations are included in the appendix. The ILP formulation has been used in our previous work that yields the previous best system for event coreference resolution (Choubey and Huang, 2018), which aims to capture several specific document level distributional patterns of coreferential event mentions by simply using heuristics. For direct comparisons, we adopt the same experimental settings as in Choubey and Huang (2018), using KBP 2015 documents as the training data and using both KBP 2016 and KBP 2017 corpora for evaluation 6 . We retrained the sentence type classifier using 102 KBP 2015 documents annotated with content types, using 15 documents as the development set and the rest as the training data. We trained the event singleton classifier using the same train/dev split. In addition, we used the same event mentions and pairwise event coreference scores produced by a local pairwise classifier the same as in Choubey and Huang (2018) 7 .
Experimental Results: We compare the content-6 All the KBP corpora include documents from both discussion forum and news articles. But as the goal of this study is to leverage discourse structures specific to news articles for improving event coreference resolution performance, we only evaluate the ILP system using news articles in the KBP corpora. This evaluation setting is consistent with our previous work Choubey and Huang (2018). For direct comparisons, the results reported for all the systems and baselines are based on news articles in the test datasets as well 7 The classifier can be obtained from https://git. io/JeDw3  structure aware ILP system with a baseline system (the row Local classifier) that performs greedy merging of event mentions using local classifier predicted pairwise coreference scores as well as two most recent models for event coreference resolution, the heuristics-based ILP system (Choubey and Huang, 2018) and another recent system (Lu and Ng, 2017). We use the same evaluation method as in (Choubey and Huang, 2018) and evaluate event coreference resolution results directly without requiring event mention type match 8 . Table 10 shows experimental results. Event coreference resolution is a challenging task as shown by the small margins of performance gains achieved by recent systems. The ILP model constrained by system predicted content structures (the row +Content Structure) outperforms the pairwise classifier baseline system as well as the two most recent systems consistently across all the evaluation metrics over the two benchmark datasets. In particular, our ILP system outperforms the previous state-of-the-art, the heuristics-based ILP system Choubey and Huang, with average F1 gains of 0.67% and 1.32% on KBP 2016 and KBP 2017 corpora respectively. The superior performance shows that systematically identified content structures are more effective than heuristics in guiding event linking, and establishes the usefulness of the new discourse profiling task.
To further evaluate the importance of ILP constraints on Singletons, Main events and Intra-type events, we perform ablation experiments by removing each constraint from the full ILP model. Based on the results in Table 10, all the three types of constraints have noticeable impacts to coreference performance, and singletons and main events constraints contribute the most.
Intuitively, news content structures can help in identifying other event relations as well, such as temporal and causal relations, and thus disentangling complete event structures. For instance, events occurring in C1 (Previous Event) sentences are probable cause for the main event which in turn causes events in M2 (Consequence) sentences (the same rationale can be applied for temporal order).

Conclusion
We have created the first broad-coverage corpus of news articles annotated with a theoretically grounded functional discourse structure. Our initial experiments using neural models ascertain the feasibility of this task. We conducted experiments and demonstrated the usefulness of news discourse profiling for event coreference resolution. In the future, we will further improve the performance of news discourse profiling by investigating subgenres of news articles, and extensively explore its usage for various other NLP tasks and applications.

A ILP for Event Coreference Resolution
Let λ refers the set of all event mentions in a document and p ij equals the score from the local pairwise classifier denoting event mentions 'i' and 'j' are coreferential. We formulate the baseline objective function that minimizes equation 1.
We then add constituent objective functions (equation 2) and new constraints to the baseline objective to incorporate document-level content structure, including repetitions of headline events in main content (Θ M ) as well as in consequence, previous event and current context (Θ C ), intra-type coreference chains in non-main contents (Θ L ) and exclusion of singletons from event coreferential chains (Θ S ) while reinforcing non-singletons to have more coreferential mentions (Θ N ).
The weighting parameters for all the constituent objective functions were obtained through grid search. We first preset all the values to 0.5 and then searched each parameter in the multiples of 0.5 over the range from 0.5 to 5. We found that the best performance was obtained for K M =3.0, K C =1.0, K S =2.5 and K N =0.5. Also, the best values for K L are 0.5 for content types M2-C1 and 1.0 for content types C2-D8.

A.1 Infusing Singletons Score in the ILP Forumlation
Intuitively, coreferential event mentions and singletons are exclusive to each other. However, enforcing such mutual exclusion would be extremely unstable when both system predicted singletons and event coreference scores are imperfect. Therefore, we simply discourage singletons from being included in any coreference chains and encourage non-singletons to form more coreferential links in our model by adding two constituent objective functions Θ S and Θ N (equation 3). Where S and N are predicted singletons and nonsingletons from content-structure aware singleton classifier. The relaxed Θ S and Θ N based implementation allows violations for predicted singletons when its pairwise coreference score with an event mention is high.

A.2 Incorporating Content Types in the ILP Forumlation
As evident from the analysis, main, consequence, previous event and current context content types favor coreferential event mentions with headline event. Furthermore, if an event chain starts in one of the C1-D4 content types, it tend to have coreferential event mentions within the same content type or sometimes in the main content. We model above correlations between main and non-main content types and event coreference chains through their respective objective functions and constraints. Main Events: for the event pairs with the first event mention from headline and the second one from main content sentences, we define a simple objective function (equation 4) that add the negative sum of their indicator variables to the main objective function.
Here, ξ H and ξ M indicate event mentions in headline and main content sentences respectively. By minimizing Θ M in global objective function, our model encourages coreferential mentions between the headline and main content sentences. Similarly, we define Θ C that encourages coreferential mentions between the headline and sentences from consequence, previous event and current context content types (equation 5).
Here, ξ R indicate event mentions in one of the consequence, previous event or current context content types.
Intra-type Events: for each non-main content type T , we define the objective function Θ L and corresponding constraint (equation 6) to penalize event chains that start in that non-main content type sentence but include event mentions from other non-main type sentences.
First, we define an ILP variable Y i for each event i in ξ T , where ξ T represents events in a non-main content type T ∈ C1-D4, and add that to the objective function Θ L . Then, through the constraint in equation 6, we set the value of Y i to Γ i when λ i is 0. Γ i equals the number of subsequent coreferential event mentions of event i in sentences of other nonmain types. γ i equals the number of antecedent coreferential even mentions of event i in sentences of main or other non-main types. By minimizing Y i in Θ L , we discourage an event chain starting in a C1-D4 content type-sentence from forming coreferential links with subsequent event mentions in other non-main types.