TEST_POSITIVE at W-NUT 2020 Shared Task-3: Cross-task modeling

The competition of extracting COVID-19 events from Twitter is to develop systems that can automatically extract related events from tweets. The built system should identify different pre-defined slots for each event, in order to answer important questions (e.g., Who is tested positive? What is the age of the person? Where is he/she?). To tackle these challenges, we propose the Joint Event Multi-task Learning (JOELIN) model. Through a unified global learning framework, we make use of all the training data across different events to learn and fine-tune the language model. Moreover, we implement a type-aware post-processing procedure using named entity recognition (NER) to further filter the predictions. JOELIN outperforms the BERT baseline by 17.2% in micro F1.


Introduction
In this work, we report the system architecture and results of the team TEST POSITIVE in the competition of W-NUT 2020 sharred Task-3: extracting COVID-19 event from Twitter.
Since February 2020, the pandemic COVID-19 has been spreading all over the world, posing a significant threat to mankind in every aspect. The information sharing about a pandemic has been critical in stopping virus spreading. With the recent advance of social networks and machine learning, we are able to automatically detect potential events of COVID cases, and identify key information to prepare ahead.
We are interested in COVID-19 related event extraction from tweets. With the prevalence of coronavirus, Twitter has been a valuable source of news and information. Twitter users share COVID-19 related topics about personal narratives and news on social media (Müller et al., 2020). The information 1 https://github.com/Chacha-Chen/JOELIN could be helpful for doctors, epidemiologists, and policymakers in controlling the pandemic. However, manual extracting useful information from tremendous amount of tweets is impossible. Hence, we aim to develop a system to automatically extract structured knowledge from Twitter.
Extracting COVID-19 related events from Twitter is non-trivial due to the following challenges: (1) How to deal with limited annotations in heterogeneous events and subtasks?. The creation of the annotated data relies completely on human labors, and thus only a limited amount of data can be obtained in each event categories. There are a variety types of events and subtasks. Many existing works solve these low resource problem by different approaches, inlcuding crowdsourcing (Müller et al., 2020;Finin et al., 2010;Potthast et al., 2018), unsupervised training (Xie et al., 2019;Hsu et al., 2017), or multi-task learning (Zhang and Yang, 2017;Pentyala et al., 2019). Here we adopt multitask training paradigm to benefit from the interevent and intra-event (subtasks) information sharing. In this way, JOELIN learns a shared embedding network globally from all events data. In this way, we implicitly augment the dataset by global training and fine-tuning the language model.
(2) How to make type-aware predictions? Existing work (Zong et al., 2020) did not encode the information of different subtask types into the model, while it could be useful in suggesting the candidate slot entity type. In order to make type-aware predictions, we propose a NER-based post-processing procedure in the end of JOELIN pipeline. We use NER to automatically tag the candidate slots and remove the candidate whose entity type does not match the corresponding subtask type. For example, as shown in Figure 1, in subtask "Who", "my wife's grandmother" is a valid candidate slot, while "old persons home", tagged as location entity, would be replaced with "Not Specified" during the post-processing. In summary, JOELIN is enabled by the following technical contributions: • A joint event multi-task learning framework for different events and subtasks. With the unified global training framework, we train and fine-tune the language model across all events and make predictions based on multi-task learning to learn from limited data.
• A NER-based type-aware post-processing approach. We leverage NER tagging on the model predictions and filter out wrong predictions based on subtask types. In this way, JOELIN benefits from subtask type prior knowledge and further boosts the performance.

Related Work
Event Extraction from Twitter Impressive efforts have been made to detect events from Twitter. Existing works include domain specific event extraction and open domain event extraction. For domain specific extraction, approaches mainly focus on extracting a particular type of events, including natural disasters (Sakaki et al., 2010), traffic events (Dabiri and Heaslip, 2019), user mobility behaviors (Yuan et al., 2013), and etc. The open domain scenario is more challenging and usually relies on unsupervised approaches. Existing works usually create clusters with event-related keywords (Parikh and Karlapalem, 2013), or named entities (McMinn and Jose, 2015;Edouard et al., 2017). Additionally, Ritter et al. (2012) and Zhou et al. (2015) design general pipelines to extract and categorize events in supervised and unsupervised manner respectively. Different from previous works, we deal with COVID-19 related event extraction in particular. Zong et al. (2020) provide a BERT baseline for the same task. But we create a unified framework to learn simultaneously for different categories of events and subtasks.
Type-aware Slot Filling Yang et al. (2016) formulate entity type constraints and use integer linear programming to combine them with relation classification. Adel and Schütze (2019) propose to in-tegrate entity and relation classes in convolutional neural networks and learn the correlation from data. We propose a NER-based post-processing technique for type-aware slot filling. By filtering out entity mis-matched predictions, JOELIN can efficiently boost the performance with minimum hand-crafted rules.

COVID-19 Twitter Analysis
With the quarantine situation, people can share thoughts and make comments about COVID-19 on Twitter. It has become a research source for researchers to explore and study. Singh et al. (2020) show that Twitter conversations indicate a spatio-temporal relationship between information flow and new cases of COVID-19. There is some work about COVID-19 datasets. Banda et al. (2020) provide a large-scale curated dataset of over 152 million tweets. Chen et al. (2020) collect tweets and forms a multilingual COVID-19 Twitter dataset. Based on the collected data, Jahanbin and Rahmanian (2020) propose a model to predict COVID-19 breakout by monitoring and tracking information on Twitter. Though there are some works about COVID-19 tweets analyisis (Müller et al., 2020;Jimenez-Sotomayor et al., 2020;Lopez et al., 2020), the work about automatically extracting structured knowledge of COVID-19 events from tweets is still limited.

Method
In this section, we introduce our approach JOELIN and its data pre-processing and post-processing steps in detail. First, we pre-process the noisy Twitter data following the data cleaning procedures in Müller et al. (2020). Second, we train JOELIN and fine-tune the pre-trained language model endto-end. Specifically, we design the JOELIN classifier in a joint event multi-task learning framework. Moreover, we provide four options of embedding types and ensemble the outputs with the highest validation score. Finally, we further utilize NER techniques to post-process our results with minimum hand-crafted rules.

Data Pre-processing
Prior to training, the original tweets are cleaned following Müller et al. (2020) Figure 2: Our approach comprises of 2 main components: (1) global language model across events and subtasks; (2) multi-task learning classifier.
<URL>, and COVID-19 related tags, such as #COVID19, #coronavirus, #COVID etc., with <COVID_TAG>. Note that the data cleaning step is designed as a hyper-parameter and can be on or off during the experiments.
We construct the training instance as follows. The annotated data is a collection of tweets. Each tweet is accompanied by hand-labeled candidate chunks. Each candidate chunk is extracted and sandwiched by a pair of tokens <E> and </E>. The masked text, together with the annotated label, will then serve as one instance of the input.

The JOELIN Model
JOELIN consists of four modules as shown in Figure 2: the pre-trained COVID Twitter BERT (CT-BERT) (Müller et al., 2020), four different embedding layers, joint event multi-task learning framework with global parameter sharing, and the output ensemble module.
COVID Twitter BERT It has been a common practice that pre-trained language models, e.g., BERT (Devlin et al., 2018) and RoBERTa , are used for a supervised fine-tuning for specific downstream tasks. In this work, we use CT-BERT as JOELIN pre-trained language model. The CT-BERT is trained on a corpus of 160M tweets related to COVID-19. CT-BERT shows great improvement compared to BERT-LARGE and RoBERTa. We further fine-tune CT-BERT with the provided dataset.

Feature Extraction
With the hidden representation of token <E> given by CT-BERT, we further apply various choices of different feature extraction methods to choose the more useful features. Inspired by Devlin et al. (2018), we implemented the following four feature extraction methods: 1. Last hidden layer: we directly use the last hidden layer of CT-BERT as our classifier input. 2. Summation of last four: we sum the last four hidden layer outputs as the classifier input. 3. Concatenation of last four (type-1): we directly concatenate the last four layers, and flatten the vector before feeding it to the classifier. 4. Concatenation of last four (type-2): Each of last four layers is passed through a fully-connected layer and reduced to a quarter of its original hidden size. We flatten the vectors before passing through the classifier. Joint Event Multi-task Learning To tackle the challenge of limited annotated data, we apply a global parameter sharing model across all events. Specifically, we jointly learn and fine-tune the language embedding across different events and apply a multi-task classifier for prediction. As shown in Figure 2, the language embedding as well as the feature extraction mechanism are jointly learned and fine-tuned globally. We then apply a fullyconnected layer as our classifier for all the subtasks in different categories of events. In this way, JOELIN benefits from using data of all the events and their subtasks. Compared with training separate models for each event, joint training across different tasks significantly boosts the performance.
Model Ensemble It has long been observed that ensembles of models boost overall performance. Hence, in this work, we train multiple models with different feature extraction approaches, and we select the top 5 models with best performance and ensemble them by majority voting. Note that we choose the best model for each subtask in validation. In testing, we use the average of the top 5.

NER-based Post-processing
We further filter our prediction based on NER for post-processing. Specifically, we use spaCy's NER model 3 to tag the predicted candidate slots. We collect all the tags of each word in candidate slots. Then we compare the entity tags with the subtask. If one of the candidate tags does not match the subtask type, we invalidate the prediction by replacing it with "NOT SPECIFIED". For example, if the subtask is "who", we nullify those candidate slots whose tags are not related to persons, as shown in Figure 1.

Implementation Details
We randomly split the dataset into training and validation in a 80:20 ratio. The model is trained with the AdamW optimizer (Loshchilov and Hutter, 2017) toward minimizing the binary cross entropy loss with batch size of 32 and learning rate of 2e-5.
To deal with the class imbalance issue, we apply class weighting on the loss function. With gridsearch, the best weight is 10 and 1 for positive and negative samples respectively.

Results and Discussion
We evaluate JOELIN with BERT and CT-BERT baselines. We measure the performance of different models with F1 score and micro F1 score, in consideration of imbalanced sample sizes. The overall results are shown in Table 1. Compared with the performance of BERT (Zong et al., 2020) and CT-BERT (Müller et al., 2020), JOELIN significantly outperforms the best baseline CT-BERT by 7.6% in micro F1. In terms of performance on subtasks, JOELIN outperforms the best baseline CT-BERT by up to 44.9% in recent travel of event TESTED POSITIVE. The performance gains of JOELIN are attributed to the well-designed joint event multi-task learning framework and the typeaware NER-based post-processing.

Ablation Study
We conduct an ablation study to understand the contribution of type-aware post-processing in JOELIN. We remove the post-processing step as a reduced model (JOELIN-P) and compare the micro F1 scores. As shown in Table 2, JOELIN has better micro F1 score in comparison with the reduced model JOELIN-P. It supports the claim that our proposed type-aware post processing with NER can significantly boost the performance.

Conclusion
In this work, we build JOELIN upon a joint event multi-task learning framework. We use NER-based   post-processing to generate type-aware predictions.
The results show JOELIN significantly boosts the performance of extracting COVID-19 events from noisy tweets over BERT and CT-BERT baselines.
In the future, we would like to extend JOELIN to open domain event extraction tasks, which is more challenging and requires a more general pipeline.