Emory at WNUT-2020 Task 2: Combining Pretrained Deep Learning Models and Feature Enrichment for Informative Tweet Identification

This paper describes the system developed by the Emory team for the WNUT-2020 Task 2: “Identifi- cation of Informative COVID-19 English Tweet”. Our system explores three recent Transformer- based deep learning models pretrained on large- scale data to encode documents. Moreover, we developed two feature enrichment methods to en- hance document embeddings by integrating emoji embeddings and syntactic features into deep learn- ing models. Our system achieved F1-score of 0.897 and accuracy of 90.1% on the test set, and ranked in the top-third of all 55 teams.


Introduction
Until August 31, 2020, the COVID-19 outbreak has caused nearly 25 million confirmed cases worldwide, including about 800K deaths (WHO, 2020). Recently, much attention has been paid to building monitoring systems to track the development of COVID-19 by aggregating related data from different sources (e.g., the Johns Hopkins Coronavirus Dashboard 1 and the WHO Coronavirus Disease Dashboard 2 ). One potentially important source of information is social media, such as Twitter and Reddit, which provides real-time updates of the COVID-19 outbreak. To deal with the massive amount of social media data, several systems have been developed to detect and extract COVID-19 related information from Twitter Banda et al., 2020;Sarker et al., 2020). However, only a minority of the data collections contain relevant information (e.g., the recovered, suspected, confirmed and death cases as well as location or travel history of the cases) that are useful for monitoring systems, and it is costly to manually identify these informative data. To help address the problem, WNUT-2020Task 2 (Nguyen et al., 2020 focuses on attracting research efforts to create text classification systems that can identify whether a COVID-19 English Tweet is informative or not. This could lead to automated information extraction systems for COVID-19, as well as benefit the development of relevant monitoring systems. This paper describes the system developed by the Emory team for this task. Our solution explores three recent transformer-based models, which are pretrained on large-scale data and achieve great success on different NLP tasks. The pretrained models convert the input document into a embedding matrix, and the first token (i.e., [CLS]) embedding is regarded as the document embedding. We also propose two methods to integrate pretrained deep learning models and empirical features by enriching document embeddings with emoji embeddings and syntactic features. The document embedding is fed into a normalization layer and an output layer, which are fine-tuned with the encoder during the training phase. The output is a probability value from 0 to 1, and the class with the highest probability is chosen. The highest F 1 -score of our system is 0.897, ranking 14 of all 55 teams in the leaderboard 3 .

Related Work
Recently, deep learning models pretrained on largescale data, such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and SpanBERT (Joshi et al., 2020), have improved the performance of many downstream NLP tasks such as named entity recognition, semantic role labeling, emotion detection, and question answering. Although these models are trained on open domain data such as news articles and English Wikipedia, several approaches have been investigated to apply pretrained deep learning models to medical domain tasks. Matero et al. (2019) proposed a dual-context neural model that combines lexical features and personality factors with BERT embeddings for suicide risk assessment.  combined BERT embeddings and BiLSTM (Bidirectional Long Short-Term Memory) to improve the performance of medical text inferences. Roitero et al. (2020) applied machine learning models and BERT to identify medical domain tweets and reported that BERT significantly outperforms some traditional machine learning models such as logistic regression and support vector machines for their task. Encouraged by this considerable progress, our work uses recent pretrained deep learning models as document encoders and fine-tunes the classification model on the dataset provided by the task organizers. Moreover, we develop two feature enrichment methods to enhance the document embeddings generated by pretrained deep learning models.
3 System Description Figure 1 shows the overall framework of our model for the classification task. In this framework, we use three Transformer-based models described in Section 3.1 as encoders to generate the document embedding e d , which is the first token (i.e., [CLS]) embedding in our model. The document embedding is normalized by layer normalization and is fed into an output layer with Softmax activation. Also, we implement two feature enrichment methods to enhance document embeddings by integrating emoji embeddings and syntactic features (described in Section 3.2).

Document Encoder
RoBERTa: Devlin et al. (2019) developed a Transformer-based model named BERT that has achieved great success in several NLP tasks. Recently, Liu et al. (2019) has released a new pretrained model named RoBERTa that is trained with the same model architecture of BERT but on different datasets and pretraining tasks. We use RoBERTa-large as the encoder which outperforms BERT-large on different NLP tasks.
XLNet: Yang et al. (2019) implemented XLNet, a generalized autoregressive method that incorporates the autoregressive model into pretraining and optimizes the language model objective using a permutation method that can overcome the limitation of the masked language model in BERT. We use XLNet-base as the encoder which has been reported to outperform BERT and RoBERTa on some NLP tasks.
ALBERT: Lan et al. (2020) proposed a variant of BERT named ALBERT that applies parameter reduction techniques to reduce memory consumption and to accelerate the training process. ALBERT improves the training efficiency of BERT by factorizing embedding parameters and sharing cross-layer parameters. We use ALBERT-xxlarge as the encoder, which has also achieved state-of-the-art results on several NLP tasks.

Layer Normalization
Output Layer Feature Enrichment

Feature Enrichment
Emoji Embedding: Emojis can succinctly represent emotional expressions and have become popular in social media. Recently, several studies have revealed that incorporating emoji information into deep learning models can benefit the performance of tweet classification tasks (Singh et al., 2019;Rangwani et al., 2018). Inspired by that, we employ a pretrained emoji encoder named emoji2vec 4 (Eisner et al., 2016) to convert emojis into emoji embeddings. The emoji embeddings are fed into a fully connection layer with T anh activation, and the output is concatenated to the [CLS] token embedding as the document embedding. If multiple emojis appear in one tweet, the emoji embeddings will be concatenated into one fixed-length vector.
Syntactic Feature: Syntactic features have been previously used on many NLP tasks (van der Lee and van den Bosch, 2017; Kurtz et al., 2019;Jabreel and Moreno, 2017), and syntactic dependencies often entail key information relevant to sentence topics. In order to utilize syntactic features, we collected a set of COVID-19 related keywords and extracted their governors (also known as heads) that hold grammatical relations with the COVID-19 related keywords using the Stanford Parser (Chen and Manning, 2014), which can indicate what aspects of the keywords are discussed in tweets. The governor embedding and the [CLS] token embedding are then fed into a self-attention layer (Vaswani et al., 2017), and at the output, the new [CLS] token embedding is regarded as the document embedding.

Ensemble Model
To improve the robustness of our final system, we apply ensemble techniques to combine the results of different models. For each individual model, we take the class with highest output probability as the inference result during the testing phase. The inference results of all models are combined using a majority vote strategy, which is to return the class predicted by most models.

Data Preprocessing
The dataset contains 10K COVID-19 English tweets of which 4719 are labeled as informative and 5218 are labeled as uninformative. To clean tweets data, we used the open source tool preprocess-twitter 5 including steps of lowercasing and normalizing numbers, hashtags, capital words and repeated letters. The dataset had been split into training, validation, and test sets with pre-specified sizes. Because the test set is not available when developing our system, we re-split the training set into a new training set and a new validation set with a 90/10 rate, and used the released validation set as the test set. The statistics for the data split are shown in Table 1

Training
Our system is implemented in PyTorch and Python 3. We develop five classification models of which three models use different encoders to generate document embeddings, and two models apply emoji embeddings and syntactic features to enrich document embeddings encoded by RoBERTa 6 . Among these models, the batch size of ALBERT is 16, and that of other models is 32. For the model applying emoji embeddings, the dimension of emoji embeddings is 300, and the max number of emojis in one tweet is 3. For the model applying syntactic features, only the governor of the first keyword is considered in order to control model complexity.
Other hyperparameters are the same for all models listed in Table 2. Each model is trained separately on the training set and evaluated on the validation set during the training phase. Each experiment runs 3 times with different random initialization, and the model that achieves the highest accuracy on the validation set is selected for the testing phase.

Task Results
We use F 1 -score as the primary metric and also report accuracy of each model.  As we can observe, the F 1 -score and accuracy of the ensemble model is marginally higher than any of the individual models. It indicates that the Albert Uderzo died in his sleep at his home in Neuilly, after a heart attack that was not linked to the coronavirus, his son-in-law Bernard de Choisy told the AFP news agency' #asterix The Indian Express: New York Zoo tiger tests positive for coronavirus: Are cats at particular risk?. HTTPURL A group of humans knows everything about this virus. Others don't know anything about it. Facsism kills in many subtle ways. Trump got away with 3,000 deaths in Puerto Rico. He's going to try it again with coronavirus. Time for all members of the media to ovary up and start reporting on the dangerous lies of this regime. document embeddings generated by the individual models can encode information from different perspectives and complement each other.

Error Analysis
In order to investigate how our system can be improved, we conducted an error analysis on the released validation set. We found that 56 out of 84 error instances were false positives, (i.e., uninformative tweets misclassified as informative), and the remaining 28 instances are false negatives (i.e., informative tweets misclassified as uninformative). We observe that most of the false positives include numbers, locations, and personal names, which can be indicators of informative tweets such as the samples presented in Table 4. These indicators can be the noise that may confuse the model and lead to misclassification of negative instances. It suggests that we still need to improve the ability of our system to understand the context of indicative components in tweets.

Ablation Study
Given multiple models developed here, we conduct ablation study on each model to see how each individual model can affect the ensemble model. The performance of ensemble models that respectively remove one of the individual models from five models are shown in Table 5. As we can see, removing any of the individual models can increase the precision and decrease the recall of the ensemble model. It indicates that ensemble modeling can better identify negative instances than positive instances. Another notable result is that the recall drops more than other models when removing RoBERTa. Combined with the classification results in Table 3, RoBERTa might contribute most to the high recall of the ensemble model. It is interesting to note that RoBERTa+Emo is the best in the individual models but the least contributed to the ensemble model. The possible reason can be that only 9% of the training data contains emojis so that the emoji features are insufficiently learned during training, which may cause the document embedding of RoBERTa+Emo not much different from that of RoBERTa.

Model
Precision

Conclusion
This paper describes the system developed by the Emory team for the WNUT-2020 Task 2: "Identification of Informative COVID-19 English Tweets", including system design, implementation, evaluation, and analysis. We explored three pretrained deep learning models to encode documents, and developed two feature enrichment methods to enhance document embeddings by integrating emoji embeddings and syntactic features. For future work, we will continue to develop our system by exploiting other feature enrichment methods, utilizing external knowledge bases, and investigating other pretrained deep learning models.