Winners at W-NUT 2020 Shared Task-3: Leveraging Event Specific and Chunk Span information for Extracting COVID Entities from Tweets

Twitter has acted as an important source of information during disasters and pandemic, especially during the times of COVID-19. In this paper, we describe our system entry for WNUT 2020 Shared Task-3. The task was aimed at automating the extraction of a variety of COVID-19 related events from Twitter, such as individuals who recently contracted the virus, someone with symptoms who were denied testing and believed remedies against the infection. The system consists of separate multi-task models for slot-filling subtasks and sentence-classification subtasks, while leveraging the useful sentence-level information for the corresponding event. The system uses COVID-Twitter-BERT with attention-weighted pooling of candidate slot-chunk features to capture the useful information chunks. The system ranks 1st at the leaderboard with F1 of 0.6598, without using any ensembles or additional datasets.


Introduction
The World Health Organization declared COVID-19, a global pandemic on March 11, 2020.As of 2020/09/21, there are over 30 million cases2 and 900,000 deaths due to the infection.With the imposed lockdown, work from home and physical distancing, social media like twitter saw an increased usage.A large part of the use was posting and consuming information on the novel infection.These information include potential reasons for contraction of the disease, such as via exposure to a family member who tested positive, or someone who is showing COVID symptoms but was denied testing.Accompanying to the pandemic was an infodemic of misinformation about COVID-19, including fake remedies, treatments and prevention-suggestions in social media (Alam et al., 2020).Zong et al. (2020) show the possibility to automatically extract structured knowledge on COVID-19 events from Twitter and released a dataset of COVID related tweets across 5 event types.We used this dataset in our experiments for the sharedtask.These tweets are annotated for whether they belong to an event (we refer to this as the event-prediction task in this paper) and their eventspecific questions (factual or opinion).We identify these event-specific questions into two types of subtasks, slot-filling and sentence classification.
Our system consists of separate multi-task models for slot-filling subtasks and sentenceclassification subtasks.Our contribution comprises improvement upon the baseline (mentioned in section 2) in three ways: • We incorporate the event-prediction task as auxiliary subtask and fuse its features for all the event-specific subtasks.
• We perform an attention-weighted pooling over the candidate chunk span enabling the model to attend to subtask specific cues.

Related Works
Sentence classification tasks (such as opinion or sentiment mining) as well as slot-filling tasks have greatly progressed with deep learning advancements such as LSTM (Hochreiter andSchmidhuber, 1997), Tree-LSTM (Tai et al., 2015) and transfer learning over pre-trained models (Peters et al., 2018;Howard and Ruder, 2018;Devlin et al., 2019).Among these, CT-Bert outperforms others on COVID related twitter tasks (Müller et al., 2020).Taking inspiration from the same, we use arXiv:2012.10052v1[cs.CL] 18 Dec 2020 CT-Bert as part of our architecture.A variety of slot-filling approaches have been built on top of these deep learning advancements (Kurata et al., 2016;Qin et al., 2019).The proposed baseline for our task (Zong et al., 2020) (2019).Due to the excellent performance offered by Bert (Devlin et al., 2019) and Baldini Soares et al. ( 2019), we build upon this baseline approach.
Due to the fast-spreading nature of the infection, it is also difficult to manually trace the spread of the pandemic.However, with twitter event-specific entity extraction and Geo-location, one could potentially build a real-time pandemic surveillance system (Lwowski and Najafirad, 2020;Al-Garadi et al., 2020).Bal et al. (2020) show that healthissues related misinformation is prevalent in social media, while Alam et al. (2020) talks about covidspecific misinformation.Such systems for extracting structured knowledge over the tweets talking about potential cures for COVID will help study how users perceive the COVID misinformation.
In §3, we describe the dataset and the problem statement.Then in §4, we discuss the details of our two multi-task models followed by experiments, results and conclusion.

Dataset and Problem statement
Now, we will briefly go over the dataset.The reader may refer (Zong et al., 2020) for full details.Each of the 7500 tweets in the dataset belongs to one of the 5 event types: tested-positive, tested-negative, can-not-test, death, and cure.The first four events aimed at extracting structured reports of coronavirus related events, such as self-reported cases or news stories about public figures who were exposed to the virus.Each tweet was first annotated for whether it belongs to its respective event (e.g.Is the tweet belonging to the tested-positive event talking about someone who tested positive?).Throughout this paper, we refer to this as the Event-Prediction task.The tweets that correspond to its event were then annotated for event-specific questions or sub-  tasks about factual information and user's opinions.All annotations are done by multiple Amazon Mechanical Turks with inter-annotation agreement.
The event-specific questions or subtasks (e.g.name, age, gender of the person tested positive) varies depending on the event.These subtasks are of two categories: slot-filling (e.g., Who tested positive/negative?,Where are they located?, Who is in close contact with person contracting the disease?) and sentence classification (e.g.Is author related to infected person?, Does the author experience any symptoms?,Does the author believe a cure method is effective?).
The dataset released tweet IDs and their annotations.We obtain our text corresponding to tweets using the official Twitter API3 .Table 1 shows the statistics for the dataset we scrapped in early July. 4igure 1 shows an annotated example from the dataset.We identify the event-specific subtasks into two categories shown in Table 2.
We now formally describe the two types of eventspecific subtasks: Slot-filling subtasks: Assume n slot-filling subtasks {S 1 , S 2 ...S n }.We set up each slot-filling subtask S i as a supervised binary classification problem.Given the tweet t and the candidate slot s, the model f (t, s) → {0, 1} predicts whether s answers its designated question.We extract a list of  Given a sentence classification subtask C i aims to learn a model g(t) → {l 1 , l 2 ...l k }, where t is a tweet and l j is a label.Here the number of labels can vary depending on the subtask, for example, gender is labelled with {Male, Female, Others/Not Specified}, Relation with {Yes, No}, Opinion with {effective, no cure, not effective, no opinion} and so on.All these subtasks are 'supervised' classification problems.
The dataset is also annotated with whether a tweet corresponds to its respective event or not.We treat this as an additional Event-Prediction task.This is a binary classification task that aims to learn a model h(t) → 0, 1 where t is a tweet.

Approach
In the following subsections §4.1 and §4.2, we describe our multi-task model for slot-filling and sentence-classification respectively.

Slot-filling
We improve upon the baseline (Zong et al., 2020) by using domain-specific Bert, using attentionweighted pooling over the candidate chunk feature sequence, incorporating auxiliary Event-Prediction task and utilizing its logits for all the slot-filling subtasks.Before describing the approach, we first describe the Bert baseline.Our slot-filling model can be seen in figure 2.
The baseline consists of Bert based classifier.It takes a tweet t as input and encloses the candidate slot s, within the tweet, inside special entity start < E > and end < /E > markers.The Bert hidden representation of token < E > is then processed through a fully connected layer with softmax activation to make the binary prediction for a task (Baldini Soares et al., 2019)  Here n is the number of slot-filling subtasks.
tasks within an event are semantically related to each other, they jointly trained the final softmax layers of all the subtasks S i in an event by sharing their Bert model parameters.
COVID Twitter Bert (CT-Bert) is a Bert-Large model pretrained on Twitter Corpus on COVID-19 topics, leading to marginal improvements from Bert on tasks based on Twitter datasets (Müller et al., 2020).This motivates us to use CT-Bert instead of Bert from the baseline model.
The baseline, uses the Bert hidden representation of token < E > for classification.Here, however, we use attention-weighted pool of the CT-Bert hidden representation of tokens between < E > and < /E > (both inclusive).Formally, let {x 0 , ...x p , ...x q , ...x n } be the output vectors from the hidden representation of CT-Bert where p and q are indices of < E > and < /E > respectively, then for any of the slot-filling subtask S j , we get its pooled vector as follows: (1) α S j i = Sof tmax p to q (x T i a S j ) where x i T denotes the transpose of x i , a S j is a trainable vector.The motivation for attention weighted pooling is that depending on the task, model can attend to different portions of the candidate slot chunk.Next we obtain the binary classification score vector: Here W S j and b S j are trainable parameters.
We treat the Event-Prediction task as an auxiliary task and then fuse its logits to each of the other slot-filling subtasks.The motivation is that a taskspecific entity shall be present in a tweet only if the tweet belongs to its respective event.
To predict the label for Event-Prediction task, we take the CT-Bert features of [CLS] token and pass it through a MultiLayer Perceptron (MLP) to get logits h ces .
We fuse h ces prediction over each subtasks S j by adding it to h S j (from (2)) to get the logits h In practice, we share the parameters of the M LP S j across all the slot-filling subtasks S j .Given a tweet t and slot s, our loss for slot-filling model over n slot-filling subtasks {S 1 , S 2 ...S n } and Event-Prediction task looks like: ) where CE loss is softmax cross entropy loss, y ces is ground truth label for Event-Prediction task and (y 1 , y 2 ...y n ) are the labels for the candidate slot s of tweet t for the subtasks {S 1 , S 2 ...S n }.We keep λ 1 = 1.
Our preprocessing for this is same as baseline.

Sentence classification
Our An attention-weighted pooling is done over the feature sequences from CT-Bert to extract the most relevant information.Formally, let {x 0 , x 1 , ......x n } be the output vectors from CT-Bert (here 0 and n are indices of [CLS] and [SEP ] respectively).Then for any of the sentence classification subtask C j , we get its pooled vector x C j as follows: where a C j , c C j are trainable vector and scalar respectively.
For the Event-Prediction task, we take the CT-Bert vector representation of [CLS] token and pass it through a MLP.Assume the MLP's final and hidden states to be v ces and h ces .
Next, we incorporate information from Event-Prediction task into sentence classification subtask C j .Since the sentence classification subtasks aren't binary classification, so, unlike the slot-filling model, we cannot merely add the Event-Prediction logits to all tasks.Additionally, we desire sentence-level event specific features for each of the sentence level predictions.Hence, we concatenate the hidden state features from the MLP of Event-Prediction task h ces to pooled vector x C j from 5 to get the logits h C j f for each subtask C j , as follows: Here T denotes transpose, [; ] denotes vector concatenation.W C j and b C j are trainable.
Given a tweet t, our loss for sentence classification model over m sentence classification subtasks {C 1 , C 2 ...C m } and Event-Prediction task is: where CE Loss is softmax cross entropy loss, y ces is ground truth label for Event-Prediction task and (y 1 , y 2 ...y m ) are the labels for tweet t for the subtasks {C 1 , C 2 ...C m }.We keep λ 2 = 1.
Preprocessing for sentence classification is done using ekphrasis library (Baziotis et al., 2017).We remove Emoji, URL, Email, punctuation and normalize text by word segmenting, lower-casing and word decontraction.

Experiments
All the experiments were performed using PyTorch (Paszke et al., 2019) and Hugging Face's transformers (Wolf et al., 2019).We use git and wandb (Biewald, 2020) for experiment tracking.Optimization is done using Adam (Kingma and Ba, 2014) with a learning rate of 2e-5.Slot-filling models are trained for 8 epochs and sentence classification model for 10 epochs.Average training time per epoch on Tesla P100 is ≈ 4 minutes for slot-filling, and ≈ 30 second for sentence classification.
We use a 70-30 split for train-valid set.The valid set is used to obtain the best threshold for each of the slot classification tasks over the grid {0.1, 0.2, ..., 0.9}.We exclude labels with "No consensus" from our data. 5ll the MLP have 1 hidden layer and 0.1 dropout.M LP S j has 4 hidden size, LeakyReLU activation (Maas et al., 2013)

Results
Our performance on the held-out test set is shown in Table 3.Our system ranks 1st position in the W-NUT 2020 Shared Task-3 (Zong et al., 2020).We also independently rank 1st for 3 of the 5 events: 'Can Not Test', 'Death', and 'Cure'.Now we discuss our various experiments.

Slot-filling:
We experimented with a variety of architectures for slot-filling model.Our (SF) is our Slot-Filling Model from §4.1.Our (SF) w/o pool is our slot-filling model that uses the CT-Bert hidden representation of token < E > to classify instead of doing an attention-weighted pooling.Our (SF) w/o CES is our slot-filling model without Event-Prediction task.CT-Bert and Bertlarge are baseline models using CT-Bert and Bertlarge instead of Bert-base.Table 4 shows the performance of these models.There is a considerable performance difference by using CT-Bert instead of Bert, demonstrate the benefits of domain specific pre-training.Our (SF) w/o pool and Our (SF) w/o CES outperform CT-Bert demonstrating the importance of Event-Prediction task and attention-weighted pooling over slot-chunk respectively.Our (SF) using CT-Bert with Event-Prediction and attentionweighted pooling performs the best among these models.
Sentence level tasks: We experimented with various architectures for sentence level tasks.Our (SC) is our Sentence Classification architecture from §4.2.Our (SC) w/o CES is our Sentence Classification without Event-Prediction task.Bert multitask model predicts using the [CLS] representation from Bert (Devlin et al., 2019).We also build an LSTM model (Hochreiter and Schmidhuber, 1997) with GloVe embedding (Pennington et al., 2014) Tokenizers package (Kaushal et al., 2020).
Table 5 shows the performance of these architectures.Our (SC) outperforms others on macro F1 and micro F1, followed by Our (SC) w/o CES.The performance difference between these two, shows the benefits of including the Event-Prediction task.While the performance difference between CT-Bert multitask and Our (SC) w/o CES shows the gains from attention weighted pooling.CT-Bert also outperforms Bert multitask, showing its usefulness in our proposed system over using Bert.Lastly, Bert multitask, and all the models using Bert/CT-Bert outperform LSTM by a very large margin demonstrating the superiority of these pretrained language models.
Separate Sentence classification and slot filling models: Consider Bert separate, a simple system treating the two categories of tasks separately.It has the Bert baseline as its slot filling model and a simple Bert sentence classifier using features from [CLS] for sentence prediction.Bert separate does not have the event-prediction auxilliary task or any attention weighted pooling.Table 6 shows the performance of Bert separate against the baseline.Bert separate outperforms the Bert baseline by a considerable margin, thus showing the importance of treating the two subtasks differently.

Model
Micro F1 Macro F1 Bert Separate .631.545Bert Baseline .608.512 Table 6: Results comparing the systems treating the sentence classification and slot-filling subtasks separately vs those treating it similarly.We report results on the valid set across all the subtasks of both categories across the 5 events.

Conclusion and Future Work
In this paper, we presented our system that bagged 1st position in the WNUT-2020 Shared Task-3 on Extracting COVID Entities from Twitter.We divided the event-specific subtasks into slot-filling and sentence classification subtasks, building separate architectures for the two.For both architectures, we used COVID-Twitter Bert, weightedattention pooling over chunk-spans/sentence and fused logits and features from auxiliary Event-Prediction task.Our ablation studies demonstrated the usefulness of each component in our system.
There is a lot of scope of improvement for subtasks with few positive labels.Pretraining on relevant data (such as COVID-misinformation datasets for event cure) is a promising direction.
Another direction would be to reduce the training and inference time of slot-filling model by not enclosing the candidate chunk within special start < E > and special end < /E > tokens.We can instead use the attention-weighted pooling over candidate slot chunks.This will reduce the number of Bert forward passes from O(k) to O(1), where k is the number of candidate chunks in a tweet.

Figure 1 :
Figure 1: An example tweet from tested negative event.

Figure 2 :
Figure 2: Slot-Filling Model, described in Section §4.1.Here n is the number of slot-filling subtasks.

Figure 3 :
Figure 3: Sentence Classification model, described in section.§4.2.Here m is the number of Sentence Classification subtasks.
modifies Bert model for slot-filling problem inspired by Baldini Soares et al.

Table 1 :
Dataset statistics, scraped during early July.
Event Sentence Classification Slot-Filling task Tested positive gender, relation who,age,recent-visit,when,where,employer,c.-contactTested negative gender, relation who,age,when,where,duration,close-contact

Table 2 :
(Ritter et al., 2011)ecific subtasks split into two subtask types: slot-filling and sentence classification candidate slot of all noun chunks and name entities in each of the tweets by using a Twitter tagging tool(Ritter et al., 2011)same as the baseline.Sentence classification subtasks: Assume m sentence classification subtasks {C 1 , C 2 ...C m , }.
. Since many slot-filling

Table 3 :
with 0.1 negative slope, rest of the MLP have 50 hidden size and Tanh activation.Micro averaged scores on the held out test set for our final submission.

Table 4 :
, and twitter-tokenization using Word-Results of slot-filling models on our 70-30 split.We report results on the valid set across all slot filling subtasks across the 5 events.

Table 5 :
Results sentence classification models on our 70-30 split.We report results on the valid set across all sentence classification subtasks across the 5 events.