Dynamically Updating Event Representations for Temporal Relation Classification with Multi-category Learning

Temporal relation classification is the pair-wise task for identifying the relation of a temporal link (TLINKs) between two mentions, i.e. event, time and document creation time (DCT). It leads to two crucial limits: 1) Two TLINKs involving a common mention do not share information. 2) Existing models with independent classifiers for each TLINK category (E2E, E2T and E2D) hinder from using the whole data. This paper presents an event centric model that allows to manage dynamic event representations across multiple TLINKs. Our model deals with three TLINK categories with multi-task learning to leverage the full size of data. The experimental results show that our proposal outperforms state-of-the-art models and two strong transfer learning baselines on both the English and Japanese data.


Introduction
Reasoning over temporal relations relevant to an event mentioned in the document can help us understand when the event begins, how long it lasts, how frequent it is, and etc. Starting with the Time-Bank (Pustejovsky et al., 2003) corpus, a series of temporal competitions (TempEval-1,2,3) (Verhagen et al., 2009(Verhagen et al., , 2010UzZaman et al., 2012) are attracting growing research efforts.
Temporal relation classification (TRC) is the task to predict a temporal relation (after, before, includes, etc.) of a TLINK from a source mention to a target mention. Less effort has been paid to explore the sharing information across 'local' pairs and TLINK categories. In recent years, a variety of dense annotation schemas are proposed to overcome the 'sparse' annotation in the original Timebank. A typical one is the Timebank-Dense (TD) corpus (Chambers et al., 2014), which performs a compulsory dense annotation with the complete graph of TLINKs for the mentions located in two neighbouring sentences. Such dense annotation increases the chance of pairs sharing common events and demands of managing 'global' event representations across pairs among TLINK categories.
However, globally managing event representations of a whole document takes an extremely heavy load for the dense corpora. Timebank-Dense contains around 10,000 TLINKs in only 36 documents and is 7 times denser than the original Timebank. Thus, we propose a simplified scenario called Source Event Centric TLINK (SECT) chain. For each event e i in a document, we group all TLINKs containing the common source event e i into the e i centric TLINK chain and align them with the chronological order of the target mentions appearing in the document. We assume that our system is capable of learning dynamic representations of the centric event e i along the SECT chain via a 'global' recurrent neural network (RNN).
DCT : 1998-02-27 An intense manhunt (e 1 ) conducted by the FBI and the bureau of alcohol, tobacco and firearms continues (e 2 ) for Rudolph in the wilderness of western north Carolina. And this week (t 1 ), FBI director Louie Freeh assigned more agents to the search (e 3 ).
We assume that dynamically updating the representation of 'manhunt' in the early step '(e 1 , e 2 )' will benefit the prediction for the later step (e 1 , e 3 ) to 'search'. 'manhunt' is supposed to hold the same 'includes' relation to 'search', as the search should be included in the continuing manhunt.
Our model further exploits a multi-task learning framework to leverage all three categories of TLINKs in the SECT chain scope. A common BERT (Devlin et al., 2019) encoder layer is applied to retrieve token embeddings. The global RNN layer manages the dynamic event and TLINK presentations in the chain. Finally, our system feeds the TLINK representations into their corresponding category-specific (E2D, E2T and E2E) classifiers to calculate a combined loss.
The contribution of this work is listed as follows: 1) We present a novel source event centric model to dynamically manage event representations across TLINKs. 2) Our model exploits a multi-task learning framework with two common layers trained by a combined category-specific loss to overcome the data isolation among TLINK categories. The experimental results suggest the effectiveness of our proposal on two datasets. All the codes of our model and two baselines is released. 3 2 Related Work

Temporal Relation Classification
Most existing temporal relation classification approaches focus on extracting various features from the textual sentence in the local pair-wise setting. Inspired by the success of neural networks in various NLP tasks, Cheng and Miyao (2017) Han et al. (2019b,a) propose a series of neural networks to achieve accuracy with less feature engineering. However, these neural models still drop in the pairwise setting. Meng and Rumshisky (2018) propose a global context layer (GCL) to store/read the solved TLINK history upon a pre-trained pair-wise classifier. However, they find slow converge when training the GCL and pair-wise classifier simultaneously. Minor improvement is observed compared to their pair-wise classifier. Our model is distinguished from their work in three focuses: 1) We constrains the model in a reasonable scope, i.e. 3 https://github.com/racerandom/ NeuralTime SECT chain. 2) We manages dynamic event representations, while their model stores/reads pair history 3) Our model integrates category-specific classifiers by multi-task learning, while they use the categories as the features in one single classifier.

Multi-task Transfer Learning
For the past three years, several successful transfer learning models (ELMO, GPT and BERT) (Peters et al., 2018;Radford et al.;Devlin et al., 2019) have been proposed, which significantly improved the state-of-the-art on a wide range of NLP tasks. (Liu et al., 2019) propose a single-task batch multitask learning approach over a common BERT to leverage a large mount of cross-task data in the fine-tuning stage.
In this work, our model deals with various categories of TLINKs (E2E, E2T and E2D) in a batch of SECT chains to calculate the combined loss with the category-specific classifiers.

Non-English Temporal Corpora
Less attention has been paid for non-English temporal corpora. Until 2014, Asahara et al. starts the first corpus-based study BCCWJ-Timebank (BT) on Japanese temporal information annotation. We explore the feasibility of our model on this Japanese dataset.

BERT Sentence Encoder
We apply a pre-trained BERT for retrieving token embeddings of input sentences. For a multipletoken mention, we treat the element-wise sum of token embeddings as the mention embedding.

Source Event Centric RNN
After the BERT layer processing, the system collects all the mention embeddings appearing in the chain: {R e 1 , R DCT , R e 2 , R t 1 , R e 3 } 4 .
Our model assigns a 'global' two-layer gated recurrent unit (GRU) model with the left-to-right direction to simulate the chronological order of the SETC chain for updating the centric e 1 embeddings. The original e 1 embedding R e 1 is sent into the GRU as the initial hidden. At i-th TLINK step, the system inputs the target mention embedding to update the i-th e 1 embedding R i e 1 for generating the {i + 1}-th step TLINK embedding T i+1 . As shown in Figure 1, the 3-rd TLINK embedding T 3 (e 1 ,t 1 ) is the concatenation of the 2-nd step R 2 e 1 and target embedding R t 1 as the follows: R 2 e 1 = max(R e 1 , GRU (R e 2 , h 1 )) (1) The element-wise max is desiged to set the initial R e 1 as an anchor to avoid the quality dropping of new hiddens after long sequential updating.

Multi-category Learning
After obtaining all the TLINK embeddings {T 1 (e 1 ,DCT ) , T 2 (e 1 ,e 2 ) , T 3 (e 1 ,t 1 ) , T 4 (e 1 ,e 3 ) } in the SECT chain via the previous two common layers, the system feeds them into the corresponding categoryspecific classifiers. Each classifier is built with one linear full-connected layer and Softmax layer. The system calculates the combined loss as the follows to perform multi-category learning.

Experiments and Results
We conduct the experiments of applying the SEC model on both the English TD and Japanese BT corpora. Juman++ (Tolmachev et al., 2018) 5 is adopted to do morphological analysis for Japanese text. TD annotation adopts a 6-relation set (after, before, simultaneous, includes, is included and vague). We follow the 'train/dev/test' data split 6 of the previous work. For BT, we follow a merged 6-relation set as (Yoshikawa et al., 2014). We perform the document-level 5-fold cross-validation.
In each split, we randomly select 15% documents as the dev set from the training set. The TLINKs statistics of the two corpora are listed in Table 1. We adopt the English and Japanese pre-trained 'base' BERT 7 and empirically set RNN hidden size equal to BERT hidden, 4 SECT chains per batch, 20 epochs, and AdamW (lr=5e-5). The other hyperparameters are selected based on the dev micro-F1. All the results are 5-run average.
For the lack of comparable transfer learning approaches, we build two BERT baselines as follows (fine-tuning 5 epochs, batch size is 16): • Local-BERT: The concatenation of two mentions as TLINK embeddings are fed into the independent category-specific classifier. • Multi-BERT: The multi-category setting as (Liu et al., 2019) of Local-BERT. Each time the system pops out a single-category batch, encodes it via the common BERT, and feed it to the category-specific classifier. 'Local-BERT' and 'Multi-BERT' serve as the baselines in the ablation test for the proposed 'SEC' model. 'Local-BERT' is the 'SEC' model removing both global RNN and multi-category learning. 'Multi-BERT' is viewed as the 'SEC' model removing global RNN.

Asynchronous Training Strategy
Fine-tuning BERT is difficultly performed with training SEC RNN simultaneously. The standard fine-tuning only requires 3 to 5 epochs, which indicates the pre-trained model tends to quickly overfit. However, the SEC RNN is randomly initialized and requires more training epochs.
• no freeze of BERT sentence encoder • freeze of BERT sentence encoder • freeze after k epochs Figure 2 shows the validation micro F1 of all TLINKs against the training epochs of the above asynchronous training strategies. no freeze shows the evidence of our concern that the curve undulate after the initial 3 epochs. freeze performs a stable learning phase with the lowest initialization. freeze after k epochs achieves the balance of the stability and high F1. Therefore, we perform the third strategy for all the following experiments. The number k is selected from {3, 4, 5} based on the validation scores. Table 2 shows the experimental results on the English TD corpus. 'CATENA' (Mirza and Tonelli, 2016) is the feature-based model combined with dense word embeddings. 'SDP-RNN' (Cheng and Miyao, 2017) is the dependency tree enhanced RNN model.'GCL' (Meng and Rumshisky, 2018) is the global context layer model introduced in § 2.1. 'Fine-grained TRC ' Vashishtha et al. (2019) is the ELMO based fine-grained TRC model with only the E2E results reported.

Main Timebank-Dense Results
It's not surprising that the proposed model substantially outperforms state-of-the-art systems, as the existing SOTA didn't exploit BERT yet. Therefore, we offer the ablation test with 'Local-BERT'(w/o multi-categories learning and global SEC RNN) and 'Multi-BERT' (w/o global SEC RNN) to investigate the benefits of our two contributions. The 'SEC' model obtains +3.2, +6.8, +5.2 F1 improvements compared to 'Local-BERT',   which suggests the effectiveness of two main proposal. The 'SEC' model further outperforms 'Multi-BERT' by 3.6 gain of the majority category E2E, 1.0 gain of E2T and 0.7 gain of E2D, which indicates the impact of the global SEC RNN. A main finding is that E2E obtains higher gains from 'global' contexts, compare to E2T and E2D. It matches the intuition that events are more globally contextualized and time expressions are usually more self-represented (e.g. normalized time values). E2D mainly requires contextual information from the single sentences by the BERT encoder. E2T takes less advantage of BERT, while multicategory training with E2E, E2D can significantly improves its performance. Table 3 shows the results in the Japanese corpus. Different from the TD annotation schema, BT specifies two E2E categories for fitting the Japanese language: 1) E2E: between two consecutive events, 2) MAT: between two consecutive matrix verb events.

Results on Non-English Data
The state-of-the-art system on BT is the feature-based approach (Yoshikawa et al., 2014). The comparisons are similar to the English data. Our 'SEC' obtains the substantial improvements compared to their work and two BERT baselines. An interesting observation is that MAT TLINKs are usually inter-sentence located at the end of SECT chains, as Japanese is a 'SOV' language. The results indicate that long distance MAT suffers from the low-quality representations in the 'local' setting and benefits from 'global' representation more.

Conclusion
This paper presents a novel transfer learning based model to boost the performance of temporal information extraction task especially for densely annotated dataset. Our model can dynamically update event representations across multiple TLINKs in a Source Event Centric chain scope. Our model exploits a multi-category learning framework to leverage the total data of three TLINK categories.
The empirical results show that our proposal outperforms the state-of-the-art systems and the ablation tests suggest the effectiveness of two main proposals. The Non-English experiments support the feasibility of our system on the Japanese data.