Template Filling with Generative Transformers

Template filling is generally tackled by a pipeline of two separate supervised systems – one for role-filler extraction and another for template/event recognition. Since pipelines consider events in isolation, they can suffer from error propagation. We introduce a framework based on end-to-end generative transformers for this task (i.e., GTT). It naturally models the dependence between entities both within a single event and across the multiple events described in a document. Experiments demonstrate that this framework substantially outperforms pipeline-based approaches, and other neural end-to-end baselines that do not model between-event dependencies. We further show that our framework specifically improves performance on documents containing multiple events.


Introduction
The classic template-filling task in information extraction involves extracting event-based templates from documents (Grishman and Sundheim, 1996;Jurafsky and Martin, 2009;Grishman, 2019). It is usually tackled by a pipeline of two separate systems, one for role-filler entity extraction -extracting event-relevant entities (e.g., noun phrases) from the document; another for template/event recognition -assigning each of the candidate role-fillers to the event(s)/template(s) that it participates in and identifying the type of each event/template.
Simplifications of the task (Patwardhan and Riloff, 2009;Riloff, 2011, 2012; assume that there is one generic template and focus only on role-filler entity extraction. However, real documents often describe multiple events (Figure 1). From the example, we can observe that between-event dependencies are important (e.g., a single organization can participate in multiple events) and can span the entire document (e.g., event-specific targets can be distant from their Several attacks were carried out in La Paz last night, one in front of government house ...
The self-styled "Zarate armed forces" sent simultaneous written messages to the media, calling on the people to oppose ...  Figure 1: The template-filling task. Role-filler entity extraction is shown on the left, and template recognition is shown on the right. Our system performs both of these document-level tasks with a single end-to-end model.
To naturally model between-event dependencies across a document for template filling, we propose a framework called "GTT" based on generative transformers (Figure 2). To our best knowledge, this is the first attempt to build an end-to-end learning framework for this task. We build our framework upon GRIT , which tackles role-filler entity extraction (REE), but not template/event recognition. GRIT performs REE by "generating" a sequence of role-filler entities, one role at a time in a prescribed manner. For the template-filling setting, we first extend the GRIT approach to include tokens representing event types (e.g., "attack", "bombing") as part of the input sequence. We further modify the decoder to attend to the event type tokens, allowing it to distinguish among events and associate event types to each role-filler entity that it generates. We evaluate our model on the MUC-4 (1992) template filling task. Empirically, our model substantially outperforms both pipeline-based and endto-end baseline models. In our analysis, we demonstrate that our model is better at capturing betweenevent dependencies, which are critical for documents that describe multiple events. Code and evaluation scripts for the project is open-sourced at https://github.com/xinyadu/gtt.

Task Definition: Template Filling
Assume we are given a set of m event types (T 1 , ..., T m ). Each event template contains a set of k roles (r 1 , ..., r k ). For a document consisting n words x 1 , x 2 , ..., x n , the system is required to extract d templates, where d ≥ 0 (d is not given as input). Each template consists of k + 1 slots: the first slot represents the event type (one of T 1 , ..., T m ). The rest of the k slots correspond to an event role (one of r 1 , ..., r k ). The system is required to fill in entities for the corresponding role, which may be filled in as null.

Methodology
Our framework is illustrated in Figure 2. First we transform the template filling task into a se-quence generation problem. Then, we train the base model on the source-target sequence pairs, and apply the model to generate the sequence; finally the sequence is transformed back to structured templates.

Template Filling as Sequence Generation
We first transform the task's input and output data into specialized source and target sequence pair encodings. As shown in Figure 2 and below, the source sequence consists of the words of the document (x 1 , x 2 , ..., x n ) prepended with the general set of tokens representing all event/template types (T 1 , ..., T m ); as well as a separator token denoting the boundary between event templates ([SEP_T]). We also add a classification token ([CLS]) and another separator token ([SEP]) at the beginning and end of this source sequence.
[CLS] works as the start token, [SEP] denotes the boundary between REEs.
For the < Role-filler Entities > of template i, following , we use the concatenation of target entity extractions for each role, separated by the separator token ([SEP]). Each entity is represented with its first mention's beginning (b) and end (e) tokens:

Base Model and Decoding Constraints
Next we describe the base model as well as special decoding constraints for template filling.
BERT as Encoder and Decoder Our model extends upon the GRIT model for REE . The base setup utilizes one BERT (Devlin et al., 2019) model for processing both the source and target tokens embeddings. To distinguish the encoder / decoder representations, it uses partial causal attention mask on the decoder side . The joint sequence of source tokens' embeddings (a 0 , a 1 , ..., a m ) and target tokens' embeddings (b 0 , b 1 , ..., b n ) are passed through BERT to obtain their contextualized representations, Pointer Decoding For the final decoder layer, we replace word prediction with a simple pointer selection mechanism. For target time step t, we first calculate the dot-product betweenb t and a 0 ,â 1 , ...,â m , c 0 , c 1 , ..., c lsrc =b t ·â 0 ,b t ·â 1 , ...,b t ·â lsrc Then we apply softmax to c 0 , c 1 , ..., c lsrc to obtain the probabilities of pointing to each source token (which may be a word or an event type), test prediction is done with greedy decoding. At each time step, argmax is applied to find the source token which has the highest probability. The decoding stops when a stop token is predicted. p 0 , p 1 , ..., p lsrc = softmax(c 0 , c 1 , ..., c lsrc ) We also add several special decoding constraints for template filling: (1) downweighting factor (0.01) to the probability of generating [SEP] and [SEP_T], in order to calibrate recall; (2) decoding cutoff stop when it ends the k th template (k =maximum number of events in one document); (3) a constraint to ensure that the pointers for the start and end token for one entity are in order.
MUC-4 consists of 1,700 documents with associated templates. We follow prior work in split: 1,300 documents for training, 200 documents (TST1+TST2) as the development set and 200 documents (TST3+TST4) as the test set. We use the metric for template filling (Chinchor, 1992) and, as in previous work, map predicted templates to gold templates during evaluation so as to optimize scores. We follow content-based mapping restrictions, i.e., the event type of the template is considered essential for the mapping to occur. 1 Missing template's slots are scored as missing, spurious template's slots are scored as spurious. Note that in our work, since we do not extract the set fillers other than the event/template type, they do not affect the performance.
Baselines and Additional Related Work As an ablation baseline, we employ a pipeline, GRIT-PIPELINE, that first uses the GRIT model for rolefiller entity extraction, and then assigns event types to each of the entities as a multi-label classification problem. We assign types by transforming the problem to multi-class classification (MCC) (Spolaor et al., 2013). As there are 6 event types (i.e., kidnapping, attack, bombing, robbery, arson, forced work stoppage) in MUC-4, we use 2 6 labels for the MCC problem.
We also compare to end-to-end baselines without modeling between-event dependencies, DYGIE++ (Wadden et al., 2019) 2 is a spanenumeration based extractive model for information extraction. The model enumerates all the possible spans in the document and passes each representation through a classifier layer to predict whether the span represents certain role-filler entity and what the role is. SEQTAGGING is a BERT-based sequence tagging model for extracting the role-fillers entities. A role-filler entity can appear in templates of different event types (e.g., "Zarate armed force" appear in both attack and bombing event). For both baselines, the prediction goal is multi-class classification. More specially, we adapt the DY-GIE++ output layer implementation to first predict the role-filler entity's role class, and then predicts its event classes conditioned on the entity's role. Note that Chambers (2013) and Cheung et al. (2013) propose to do event schema induction with unsupervised learning. Given their unsupervised nature, empirically the performance is worse than supervised models (Patwardhan and Riloff, 2009

Results and Analysis
Results on the full test set are shown in Table 2. We report the micro-average performance (precision, recall and F1). We see that our framework substantially outperforms the baseline extraction models in precision, recall and F1, with approximately a 4% F1 increase over the end-to-end baselines. It outperforms the GRIT-PIPELINE system by around 3% F1 ( * denotes p < 0.05).
2 Our own re-implementation.  Per-slot F1 score is reported in Table 1. The results demonstrate that our framework more often predicts the correct event type, performs better on PERPIND and PERPORG, and achieves slightly worse performance with GRIT-PIPELINE on roles that appear later in the template (i.e., TAR-GET and VICTIM). We also found that DY-GIE++ performs better on TARGET, mainly due to its high precision in role assignment for spans.

Between-Event Dependencies
We also show results (Table 3) on the subset of documents that contains more than one gold event. We see the F1 score for all systems drops substantially, proving the difficulty of the task, as compared to the single/no event case. When compared to the Full Test setting in Table 2, the baselines all increase in precision and drop substantially in recall, while our approach's precision and recall drop a little. This change is understandable, as the baseline systems are more conservative and tend to predict fewer templates. As the number of gold templates increases, the fewer templates predictions have a better chance of getting matched, but their recall drops as well.
How performance changes when E increases In Figure 3, we see that when the number of gold events in the document is smaller (E = 1, 2), our approach performs on par with the pipeline-based and DYGIE++ baselines. However, as E grows larger, the baselines' F1 drop significantly (e.g., over -10% as E grows from 2 to 3). Qualitative Case Analysis Consider the input document (doc id TST3-MUC4-0080) 3 , which contains an attack and a bombing template. In the gold annotations, "Farabundo Marti National Liberation Front" acts as PERPORG in both events. Our model correctly extracts the two events and the PERPORG in each while DYGIE++ only predicts the attack event with its PERPORG role entity correctly. Although GRIT-PIPELINE gets both events correct, it failed to extract this PERPORG entity for the second event.

Conclusion
We revisit the classic NLP problem of template filling and propose an end-to-end learning framework called GTT. Through modeling events relation, our approach better captures dependencies across the document and performs substantially better on multi-event documents.