Query-focused Scenario Construction

The news coverage of events often contains not one but multiple incompatible accounts of what happened. We develop a query-based system that extracts compatible sets of events (scenarios) from such data, formulated as one-class clustering. Our system incrementally evaluates each event’s compatibility with already selected events, taking order into account. We use synthetic data consisting of article mixtures for scalable training and evaluate our model on a new human-curated dataset of scenarios about real-world news topics. Stronger neural network models and harder synthetic training settings are both important to achieve high performance, and our final scenario construction system substantially outperforms baselines based on prior work.


Introduction
While a situation is developing, news reports often contain multiple contradictory stories (scenarios) of what happened, and it is hard to piece together the individual scenarios. For example, surrounding the disappearance of the Saudi journalist Jamal Khashoggi, there were initially multiple conflicting accounts of what happened. One states that he was the victim of a murder scheme; an alternative suggests that he walked out of the consulate alive. The task of identifying these individual scenarios is also being considered in the Active Interpretation of Disparate Alternatives (AIDA) program, 1 and in a recent Text Analysis Conference (TAC). 2 We frame the task as query-based scenario discovery: given a topic (e.g., the disappearance of Jamal Khashoggi) and a query (e.g. Jamal Khashoggi was murdered), we want to retrieve Query: Jamal Khashoggi was murdered. Jamal Khashoggi entered the consulate of Saudi Arabia consulate in Istambul. He exited the Saudi consulate after a few minutes. The team wanted to arrest Khashoggi but botched it. He never exited the Saudi consulate but died there. Khashoggi, according to the reporter, was seen on a flight leaving Turkey for Estonia. A team flew from Saudi Arabia to Turkey prior to Khashoggi's appointment at the consulate specifically to intercept him. The team was sent by the Saudi crown prince with the order to murder Khashoggi. Jamal A. Khashoggi works for The Washington Post, and is the editor-in-chief of Al-Arab News. Figure 1: An example for query-based scenario construction. Given the query, we want to select eventdenoting sentences from a document mixture to build a target scenario with a sequence of compatible events. The mixture also contains sentences which may be irrelevant or part of an alternative scenario. a scenario, a set of compatible events, from the given reports. We formulate query-based scenario discovery as one-class clustering (Bekkerman and Crammer, 2008). We specifically focus on discovering a scenario of compatible events (Barzilay and Lapata, 2008;Jurafsky, 2008, 2009;Mostafazadeh et al., 2017) in a collection of related and unrelated event-denoting sentences, which may contain conflicting and irrelevant information. We start with a query (see Figure 1) and then iteratively insert sentences into the "scenario-in-construction". Sentences are chosen based on overall compatibility as well as the ease with which scenario sentences can be arranged into an order. We additionally use an adapted relation network (Santoro et al., 2017) to assess connections between words.
For our evaluation, we collect a human-curated set of competing scenarios for real-world news topics. As collecting such data is costly, we follow past work in training our model on synthetic data consisting of document mixtures  and compare our models directly to theirs. We show that training on such synthetic data yields a model that can substantially outper-  Table 1) of generating human evaluation data Human100: given the topic Why did MH-17 crash? and the scenario MH-17 had a bomb on board. The annotator searches the web and finds the webpages above. From these pages, she first selects 4 sentences which are relevant, then sorts them so that they make a consistent scenario that could be read from start to finish. form lexical baselines and the strong neural model proposed in Wang et al.'s seminal work.
Our contribution is three-fold: (1) A querybased scenario construction task, for which we introduce a model to iteratively build a scenario with compatible events, exploiting ordering. (2) A human-curated evaluation set consisting of multiple accounts of real-world new events, along with a collection of scalably-built synthetic simulation datasets, which we show serve as an effective source of training data. (3) Comprehensive experiments and analysis that cast light on the properties of the task and data, as well as on the challenges.

Background
Our work traces its roots to research in script (Schank and Abelson, 1977;Mooney and DeJong, 1985) and narrative schema learning (Chambers and Jurafsky, 2008;Pichotta and Mooney, 2016). Early work explored tasks such as script modeling (Mooney and DeJong, 1985). Recent work built on the idea that compatibility of events can be learned from corpus data, evaluated on narrative cloze (Chambers and Jurafsky, 2009) and predicting-next-events (Pichotta and Mooney, 2016;Mostafazadeh et al., 2017).
We introduce a task with a more practical objective in mind: given a query or an information cue, extract the rest of the pieces to build a compatible scenario. The task is related to conversation disentanglement of multiple entangled conversations in a dialogue transcript Charniak, 2008, 2011;Jiang et al., 2018;Kummerfeld et al., 2019), and more closely to narrative clustering , i.e. identifying all the scenarios in an information source by grouping relevant sentences/events. Unlike , we do not attempt to identify all the scenarios in the source, but are guided by one particular user's information need (e.g. the scenario about Khashoggi's murder, as opposed to all the theories regarding his disappearance, like in ). Further, we do not assume the number of scenarios is known a priori (as

do).
We phrase query-based scenario construction as one-class clustering. One-class models assume the availability of data for only one (positive) class, rather than positive and negative data. In particular, one-class clustering assumes that the data consists of few positive cases among a large group of outliers (Bekkerman and Crammer, 2008;Banu  (1) Doomed flight MH17 was downed over eastern Ukraine by on-board explosives planted by the CIA and detonated via a signal sent from a satellite in space, according to a bizarre new Russian claim. (2) Russian security expert Sergei Sokolov head of a shadowy organisation called Analytics and Security which boasts close ties to the secret services in Moscow claimed today that the plane was downed in an operation called 17.17 by the CIA and other Western intelligence agencies. (3) Today audio was released by Life News in an article in which experts claim MH17 was in fact downed not by a jet or Buk missile, but by a bomb on board the plane. (4) The audio in the article is claimed to be a conversation between a Ukrainian SU-27 pilot near MH17 and his controller, and includes a claim by the pilot the explosion that downed MH17 was possibly on board.
(1) The shoot-down occurred in the War in Donbass, during the Battle of Shakhtarsk, in an area controlled by pro-Russian rebels.
(2) The responsibility for investigation was delegated to the Dutch Safety Board (DSB) and the Dutch-led joint investigation team (JIT), who concluded that the airliner was downed by a Buk surface-to-air missile launched from pro-Russian separatist-controlled territory in Ukraine. (3) According to the JIT, the Buk that was used originated from the 53rd Anti-Aircraft Missile Brigade of the Russian Federation, and had been transported from Russia on the day of the crash, fired from a field in a rebel-controlled area, and the launcher returned to Russia after it was used to shoot down MH17. (4) Previously, the investigative website Bellingcat has pointed to involvement of the same brigade using open-source information. Table 1: An example from the crowdsourced Human100 dataset. Phase 1 (in bold): an MTurk worker writes (by web searching and editing) a topic question and two scenarios that answers the question. Phase 2: an annotator elaborates on the two scenarios (also through search-and-edit) with a compatible scenario. and Karthikeyan, 2014). Finally, our task is superficially similar to query-based summarization Baumel et al., 2018) but has a different goal: we want to distinguish potentially conflicting narrative scenarios rather than conduct single-topic information compression. We also distinguish our work from multi-document summarization (McKeown et al., 2002;, as we are explicitly drawing distinctions among conflictive scenarios rather than summarizing the entire (single) topic.

Data
This section first introduces our human-curated, realistic evaluation data for our objective. Then we describe how we synthesized various types of training data for our model.

Human-curated Data
Realistic data on this task is hard to obtain, as after a time, a single scenario tends to dominate in the news. The Linguistics Data Consortium (LDC) has data for the AIDA project, and the 2018 Text Analysis Conference (TAC) had a hypothesis generation task, but both use a single topic only (the Russia-Ukraine conflict of 2014), with no hypotheses available (TAC) or no hypotheses at sentence level (LDC). As a step in the direction of realistic data for the task, we had human annotators collect news items that have multiple scenarios of what happened around the same topic (Table 1).
We collected data in two phases: in phase 1, we asked workers on Amazon Mechanical Turk (MTurk) to provide (1) a topic in the form of an English wh-question; (2) two scenarios that answer the topic question mutually exclusively. See the bold text in Table 1 for an example. In phase 2 ( Figure 2), a group of non-Turk annotators 3 pieced together English sentences from the web to elaborate on the scenarios from phase 1 ( Table  1, non-bold). Annotators were instructed to build scenarios which can be read fluently. Sentences that could be copy-and-pasted directly from web search results were prioritized. When such sentences were unattainable, annotators were allowed to edit the style of the text to fit the scenario in the tone of a news report. We allowed this relaxation because, as mentioned above, often one scenario is dominant in web search results. For instance, in the Jamal Khashoggi story, English media almost unanimously report that he was a victim of murder, whereas the alternative scenario -he "disappeared and is still alive" is harder to find. A set of 100 mixtures were collected this way. We refer to the dataset as Human100 (stats in Table 2). 4  gauge the difficulty of a mixture by measuring the topic similarity between the target and the distractor(s) using the cosine distance between the average word embeddings. A mixture of documents (or scenarios) will typically be more difficult to separate if the scenarios/documents are more topically similar, since lexical cues are less reliable in this case. Specif- Figure 3: Synthetic data generation illustrated with our running example. First randomly sample news paragraphs (one is randomly assigned as the target scenario, the rest are distractors) and random sentences, then mix the target scenario with (a) other complete distractor scenarios (red) and (b) random sentences.
ically, Wang et al. found that their models struggled to separate scenarios even at 0.6 topical similarity, with accuracy on a binary clustering task dropping from 85% (all) to 68% (the hard ones). By this criterion Human100, at 0.8 average topic similarity cosine, is a hard dataset that challenges NLP models for their abilities to perform beyond shallow textual inference. Human-level performance is nonetheless fairly strong: 0.81 with distractor scenarios, and 0.97 with randomly sampled sentences (both are F1 scores, more details in Section 5).

Training with Proxy Synthetic Data
The idea we follow for synthetic data creation is the same as in : we can use different articles as proxies for different scenarios, even though articles in the synthetic setting do not necessarily belong to the same topic. Our hypothesis is that a model trained to predict which sentences came from the same news article will also do better at predicting which sentences come from the same scenario in the human-curated data.
We synthesize data from two source corpora: the New York Times portion of English Giga-  word (Graff et al., 2003) (NYT), which  used to construct their document mixtures, and ROCStories (Mostafazadeh et al., 2017) (ROC). NYT is chosen for having the same domain as the human-constructed news data. 5 With ROC we want to gauge the generality of our approach out-of-domain: on news-only data a model could "cheat" by memorizing common named entities. We want to see to what extent models go beyond that to learn general event compatibility. The synthesis method is summarized in Figure  3, with corpus statistics in Table 2. In the first condition we mix a randomly sampled target scenario with a distractor scenario (also randomly sampled), following . The mixtures are denoted NYT/ROC-w18. For the second condition, we replace the distractor scenario with unconnected randomly sampled sentences (corpuswide), hence NYT/ROC-rand. We also combine both conditions, giving a mixture with both types of distractors. We also vary the number of distractor scenarios in a mixture (2, 3, or, 4, including the target scenario). To equalize the number of sentences in mixtures, we pad them all to a fixed number of sentences. We call these NYT/ROC-2/3/4.

Models
Given a query q and a mixture of sentences, we want to select sentences that form a compatible scenario with the query. Our models select the sentences iteratively: the process begins with a target set T (1) = {q} (i.e. initialized with only the query in the set) and a candidate set C (1) (i.e. the mixture), and terminates at some time step i with a predicted scenario T (i) .
We experiment with two termination conditions: (1) fixed #sentences: a pre-specified number of sentences are extracted; (2) dynamic #sentences: a special end-of-scenario token <end> is predicted as the next candidate. (1) simulates the case where the user desires to specify the amount of information to be extracted (i.e. a consistent yet not necessarily all-inclusive scenario), and (2) the case where the model finds a complete scenario. 6 Notation We describe one step of candidate selection without loss of generality, thus whenever no confusion arises, we drop the timestep superscripts to use T , C for simplification. t ∈ T denotes a sentence in the target scenario, and c ∈ C a candidate. We use bold lower case letters for embeddings. The acronyms for the models are introduced at the beginning of model description (e.g. COMP for compatibility-attention).

Architectures
Compatibility-Attention (COMP) In scenario building, intuitively we select a candidate c j that fits best with T such that the updated target scenario is T ∪ {c j }, the most compatible scenarioin-construction possible. The predictionĉ j is then: For example, take the example in Figure 1, He never exited the consulate but died there is a good candidate, as it clearly relates to the scenario that Khashoggi was murdered, compatible with the current scenario-in-construction T (1) = {Jamal Khashoggi was murdered}. Now note for T (i) , where i > 1 (i.e. multiple sentences in the scenario-in-construction), its member sentences do not contribute equally to the decision on aĉ j . For example, say T (2) = {Jamal Khashoggi was murdered; A team flew from Saudi Arabia ... to intercept him}, the first sentence is more informative for us to pick out He never exited the consulate but died there as a good candidate. We implement this with a bi-linear attention layer with parameters U : i.e. c j attends to the sentences {t k } in the current scenario, which computes a vector that is scored to compute the weight score of c j under COMP.

Insertion-sort based selection (COMP-INS) If
T is an ordered scenario, it is possible to further improve the quality of the candidate selection by selecting a c j that is easy-to-insert into T . For instance, for the example in Figure 1, the most readable update can be made by inserting He never exited the consulate but died there to the right of Jamal Khashoggi was murdered. Essentially, we imagine an insertion-sort based candidate selection technique: we iteratively pick out the easiest candidate to insert and maintain T as ordered. Crucially, note that we want to use ordering to aid clustering, rather than aiming for ordered scenarios: the model performance is only evaluated on clustering the correct set of sentences.
Let m k be the insertion point to the left of the sentence t k . For T we have insertion points {m 1 , · · · , m |T |+1 }. For each insertion point, candidate tuple m k , c j , we want to compute a weight z k←j to indicate the "easy-to-insert-ness" of c j to insertion point m k . First we embed m k and c j : c j is embedded with a BiLSTM, and m k is computed from the embedded sentences in T : Finally, applying a bilinear function: i.e. the easy-to-insert-ness of c j is indicated with its highest z score over all the available insertion points: the larger the largest z k←j is, the clearer the model is about where to insert c j .

Relation Networks (COMP-INS-RN)
Oftentimes word tokens in T and C are also indicative of which candidate is the best. E.g. in Figure  1, the candidate He never exited the consulate but died there has the event died which relates to the murder in the query. Similarly the entity he is likely anaphoric to Jamal Khashoggi. The relations make the sentence an ideal candidate. 7 Adapting the Relation Network as per San-7 Note that  apply a mutual attention mechanism (Hermann et al., 2015) for the similar modeling purpose, but with many more parameters than our relation net. For practicality we believe the latter is a better option. toro et al. (2017), 8 we summarize the relation between the word tokens in c j and T with a summary score r j , i.e. how strongly the candidate is connected to the scenario-in-construction, based on the relations between its tokens and that of the scenario-in-construction. The process breaks down to three steps: first, we pair c j with each of the sentences t k ∈ T and compute a sentence-sentence relational vector v j,k which is the sum over all the word-word bi-linear contraction (the bi-linear contraction of two word embed- . Then, we average over the sentencesentence relation vectors to obtain a summary vector p j = average k (v j,k ). Finally we compute r j = linear(p j ). We incorporate the relation network with COMP-INS by adding r j to all the z k←j over k. Patterning after the model descriptions above, we get a model COMP-INS-RN(T , c j ):

Optimization
Since our models' supervision comes in the form of whole scenarios, supervising our iterative insertion clustering model is not completely straightforward. In particular, there may be multiple correct sentences that can be added to the current scenario. For example, in Figure 1, all the sentences labeled with (B) are possible candidates. We thus optimize the marginal log-likelihood of making any correct decision. 9 Specifically, we minimize: L = −log j p(c j ), which maximizes the sum of the probabilities of the correct candidates. For Eq. (5, 7), we optimize for i.e. maximizing the probability sum over all correct (insertion point, candidate) pairs. 10 At train time, we treat each timestep of scenario construction as a training example. We use a form of teacher forcing where we assume that a correct partial scenario has been built and we want to make some correct decision from there. The partial scenarios are built by adding random correct sentences, so the model learns to make correct decisions in a range of settings.

Experiments
We use scalably synthesized data for training and reserve the realistic data, which is expensive to produce, for evaluation. To better understand which techniques work, we also conduct evaluation on the synthetic data.

Data Preparation
The statistics of our source corpora are summarized in Table 2. For all the synthetic datasets, we make 100k mixtures for training, 5k for validation and 10k for test, mixed from a 0.85/0.05/0.10 split of the total 113k scenarios. For NYT/ROC-w18, to properly compare with , all mixtures have 2 scenarios. For NYT/ROC-2/3/4, we first mix 2/3/4 randomly sampled scenarios, then pad the resulting mixtures to a fixed lengths. For NYT/ROC-rand, we sample equal #sentences as for NYT/ROC-2/3/4.
(2) AVG, an iterative model that always selects the candidate the embedding of which is the closest (in cosine) to the average embedding of the sentences in the scenario-in-construction. A sentence embedding is the average over each of its word embeddings. (3) PAIRWISE, which adapts 's best model. It predicts a probability for a pair of sentences to indicate how likely they are from the same scenario. PAIRWISE replaces the cosine in AVG with the pairwise model. Implementation All the models are constructed with PyTorch 0.4.1 (Paszke et al., 2017). We use pretrained 1024-dim ELMo embeddings (Peters et al., 2018). The hidden size for the BiLSTM and the relation network are 200. We run 10 epochs with an initial learning rate of 1e-4 (with Adam (Kingma and Ba, 2014)).
Evaluation For clustering performance, we use macro-averaged F1, comparing our recovered  Table 3: Intrinsic evaluation: F1 scores (testing) for models trained on different NYT mixtures in fixed and dynamic #sentences conditions. In the dynamic #sentences condition, the baselines no longer apply because they do not model a stopping condition.

Constructing Effective Training Mixtures
Which method for synthetic mixture creation leads to the best results on Human100? We first run our models in the three mixing conditions -scenario distractor only, random sentence distractor only, and both distractors (see Section 3.2), then evaluate both intrinsically and on the human-curated Human100. Here we only use our domain-proxy NYT-* datasets.  Examining the results in Table 3, 12 we observe the hybrid mixtures with both types of distractors are the most difficult, with substantially lower performance. But how does this translate into human evaluation? In Table 4, we evaluate the best-performing model (COMP-INS-RN, trained on NYT-* sets) on Human100. We find that harder training conditions (i.e., the hybrid mixing) give stronger results on Human100. Our initial conclusion is: the more challenging hybrid mixing serves better as a training proxy to the realistic data. We also see in both Tables 3 and 4 that the dynamic #sent setting, where the model needs to decide when to stop adding events, is considerably more difficult throughout.

Do Our Modules All Contribute?
We additionally evaluate our proposed modulesinsertion-sort based selection and relation nets -to see which contributes substantially in the intrinsic evaluation (Table 3). The COMP-INS module achieves a gain of 3 points of F1 on average over COMP, and COMP-INS-RN improves 4.5 F1 on average over COMP-INS. In addition we see a clear and large margin of the models over the baselines. To evaluate the modules on Human100, we use the models trained on the best hybrid mixtures (Section 5.1). The results are summarized in Table 5  Similar to the intrinsic evaluation, both modules improve performance across fixed and dynamic conditions. While in intrinsic evaluation, relation nets are the stronger contributor, insertionsort based selection leads to a higher performance gain on Human100.  While sorting performance in itself is not very high, it has a reasonable correlation with clustering performance (Table 6): following Cui et al. (2018), we use Kendall's τ to compute sorting performance (as correlation of predicted and gold ordering). We then calculate the correlation between sorting performance and model performance, using Spearman's ρ as the most widely used correlation measure in NLP. 13

Do We Learn Compatibility that
Generalizes?
As argued previously (Section 3.1), NYT contains plenty of shallow textual cues, meaning an expressive model can do well at the task doing bag-ofwords clustering of the data rather than more sophisticated event compatibility inference.  The first question is: do the models generalize out-of-domain, particularly when textual cues are much fewer? We train our strongest COMP-INS-RN on NYT-* and evaluate on the corresponding ROC-* datasets (Table 7): in in-domain evaluation (i.e., train and test on ROC) our model produces excellent performance, and in out-of-domain evaluation (i.e., train on NYT test on ROC) it manages to keep up with fairly strong results. This indicates our model captures information beyond surface cues.

Final Model
From the domain-generalization test in Table 7, we see there is likely NYT-* sets do not subsume all the information in ROC-* sets. Exploiting all the data we have available, we combine ROC and NYT in a domain-joint training. The results in Table 8 show that in both fixed and dynamic #sent conditions, the model improves on the performance with single-domain training (Table 4, 5).

Analyzing Human-curated Data
To set up a human-level performance benchmark, we asked two additional annotators (they did not participate in the collection of Human100) to manually perform the same task as the models in fixed #sentences condition on a sample of 30 with randomly chosen query and target scenario. On average the F1 is 0.82 (one worker 0.81, the other 0.83). While even our best model (COMP-INS-RN) is inferior to the human-level performance, it draws quite close.
This however does not tell the whole story: qualitatively comparing the scenarios built by COMP-INS-RN vs. annotators, we observe human annotators tend to construct much more reasonable scenarios even when they include sentences 2720  with modified Human100, i.e. the distractor scenario is now a random sample from NYT rather than one collected by an annotator. Human100 has a topic similarity of over 0.8 but the modified version only 0.45. This demonstrates high topic similarity is a strong contributor to the difficulty of Human100, which is true for both models and humans.
Target scenario: Trump says Russia is the sole party to be blamed.
(1) Trump says that Russia is the sole party to be blamed.
(2) White House issued a statement which says Moscow is violating the Reagan-era agreement.
(3) Secretary of State Mike Pompeo announced the decision to suspend the accord, declaring that countries must be held scenarioable when they break the rules. (4) "We can no longer be restricted by the treaty while Russia shamelessly violates it," Mr. Pompeo said.
Distractor scenario: Russians accused the wrong doing on the part of the Trump administration (1) Russians accuse the wrong doing on the part of the Trump administration.
(2) This is the latest step in the Trump administrations pattern of abandoning the diplomatic tools that have prevented nuclear war for 70 years.
(3) Russia has also complained about the alleged lack of U.S. diplomacy. (4) Russian foreign minister Sergei Lavrov accused the U.S. of being obstinate. U.S. representatives arrived with a prepared position that was based on an ultimatum and centered on a demand for us to destroy this rocket, its launchers and all related equipment under US supervision. Table 9: Humans make more reasonable mistakes (the query is underscored): the annotator selected sentence (2) as a part of the target scenario, which, while not part of the gold, does make a comperent scenario. The model however chose sentence (1), which is in direct contradiction with the target scenario.
are not from the gold scenario (Table 9). This indicates that models for the task could benefit from textual inference capabilities (Cases et al., 2017), or from deeper meaning representations.

Discussion
The results on Human100 are largely in line with those on synthetic datasets, which indicates that results on synthetic data gives a reasonable estimate of results on more realistic data. While the performance on Human100 is lower overall, the findings are encouraging. Further, realistic cases of scenario discovery have the property that different scenarios for the same topic have a high vocabulary overlap. This can be seen in Human100. This property of the data penalizes shallow processing based models while encouraging learning deeper semantics. One caveat about Human100 is that the dataset is still relatively small; a larger dataset would be useful to strengthen quantitative analysis. Also, variable-size scenarios would be more realistic for evaluating the more general case. Finally, we would like to improve the crowdsourcing technique in future work: in phase 1, some collected scenarios are not entirely in conflict with each other, for example two talking points Trump made in his State of the Union Address. An extra step where additional crowdworkers rate the compatibility of scenarios could be useful. In phase 2, we would like to have scenarios which exhibit more nuanced conflicting points that capture a wider range of cues that distinguish different scenarios.

Conclusion
Identifying an individual scenario in a blend of contradictory stories is a task that is targeted by the Active Interpretation of Disparate Alternatives (AIDA) program as well in a recent Text Analysis Conference (TAC). We address this task through query-based scenario construction, and find sizable performance improvements both from taking sentence order into account (INS) and from encoding connections between words (RN). Evaluating on a new human-curated dataset, we find that the synthetic training data serves as a reasonable proxy for the human-curated data.
Our current model sometimes gets misled by superficially similar sentences, and it will be an important future direction to move towards deeper reasoning for the task. In addition, we plan to create larger human-curated datasets with variable size scenarios.