IIITBH at WNUT-2020 Task 2: Exploiting the best of both worlds

In this paper, we present IIITBH team’s effort to solve the second shared task of the 6th Workshop on Noisy User-generated Text (W-NUT)i.e Identification of informative COVID-19 English Tweets. The central theme of the task is to develop a system that automatically identify whether an English Tweet related to the novel coronavirus (COVID-19) is Informative or not. Our approach is based on exploiting semantic information from both max pooling and average pooling, to this end we propose two models.


Introduction
COVID-19 pandemic started in Wuhan, China in December 2019, caused by the infection of individuals by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) this dangerous virus is spreading around the world since then. The COVID-19 pandemic continues to have a devastating effect on the health and well-being of the global population. It is creating fear and panic for people all around the world, while the vaccine can hopefully brings the situation under control soon. To track the development of the outbreak and to provide users with the information related to the virus, e.g. any new cases in the user's regions. Need for building real-time monitoring system which uses social network data like Twitter is high. However, manual approaches to identify the informative Tweets require significant human efforts and thus are costly. To help handle this problem, WNUT shared task 2 (Nguyen et al., 2020) aim participants to build systems to automatically identify whether a COVID-19 English Tweet is informative or not. Such informative Tweets provide information about recovered, suspected, confirmed and death cases as well as location or travel history of the cases.
Pooling-based recurrent neural architectures consistently outperform their counterparts without pooling (Maini et al., 2020). However, the reasons for their enhanced performance are largely unexamined. In this work, we examine how two most commonly used pooling techniques (mean-pooling or average pooling, and max-pooling) perform for solving WNUT-2020 shared task 2 1 and develop two novel systems exploiting semantic features of both techniques.

Data
Dataset consists a total of 10, 000 tweets split into training, validation, test set in 70/10/20 ratio respectively. Detailed breakdown of data is shown in Table 1. Maximum and minimum length of tweets in test data is 64 and 8 respectively. Distribution of tweet length in test dataset is illustrated in Figure 1

Proposed Methodology
In our proposed architecture, we aim to leverage the semantic information from both pooling layers for identifying whether given tweet is informative or not. In this section, we describe our method (base model illustrated in Figure 2) and elaborate on each part with details.

Bidirectional LSTM
Recurrent neural network (RNN) is a form of neural network which maintains a memory based on history information. RNNs are good for sequential prediction, but the problem of exploding or vanishing gradients makes learning long distance dependencies very difficult for them (Hochreiter, 1998). The LSTM architecture is proposed to address this problem (Hochreiter and Schmidhuber, 1997). Bidirectional LSTM uses the features coming from both the previous hidden states as well as the future hidden states. This structure allows the networks to have both forward and backward information about the sequence at every time step. It helps the language model in understanding the context better (Schuster and Paliwal, 1997). Formally, at time t, the memory, c t , and the hidden state, h t , are updated with the following equations.
where x is the input at time step t. Bidirectional LSTM contains two separate LSTMs to capture both past and future inputs. One of the LSTM networks encodes the sentence from left to right and the other one from right to left.
Thus, for each time step t, we obtain two representations, − → h t and ← − h t , finally these two representations are concatenated to form the final output, h T .
For the purpose of simplifying the information in the output from the Bi LSTM layer (passed through the activation function), pooling layers are used. Pooling layer is a down sampling method, which reduces the number of parameters of the feature map,retaining the important information. Different pooling types like average, max, sum, etc., present. However common pooling types are Max pooling and Average Pooling.
Let S be an input tweet, where x t is a representation of the input word at position t. A recurrent neural network such as a Bi-LSTM produces a hidden state h T (equation 7) .

Average Pooling
Average pooling weighs down the activation by combining the nonmaximal activations (Passricha and Aggarwal, 2019) where w is width of pooling window The use of a global average pooling(ξ ap ) layer as a last layer was proposed by (Lin et al., 2013), and got its breakthrough by the well known image recognization system, the residual network (ResNet) (He et al., 2015).

Max Pooling
Max pooling extracts only the maximum activations (Passricha and Aggarwal, 2019) independent of distribution. One dimensional max pooling can be expressed as follow: where w is width of pooling window Global max pooling(ξ mp ) was proposed for weakly-supervised learning (Oquab et al., 2014) and is also used in the PHOCNet for the task of word spotting. (Sudholt and Fink, 2016  From (Section 3.2 and 3.3) we know that max pooling identify only maximum activations irrespective of distribution and frequency, wheras average pooling focus on distribution, and frequency irrespective of maximum values. To leverage this both types of information we propose the following two models (Section 3.4 and 3.5)

Model I
In Model I, we simply concatenate both global max pooling and global average pooling layers (equation 15). Though this can be considered as a naive model but previous works (Nguyen et al., 2018) (Sun et al., 2018) (Tu et al., 2017) suggests that feature concatenation improves performance of systems, our results supported this intuition.

Model II
.
In Model II, we intend to use the information such as distribution and frequency from average pooling to understand the context better. While the max-pooling layer attempts to find the most important latent semantic factors in the tweet (Lai et al., 2015). First, we compute the dot product of global average pooling and global max pooling (equation 17), and later multiply with global average pooling (equation 18) where ξ mp , ξ ap are global max pooling and global average pooling from eq 14 and eq 11 respectively.ŷ where, σ(z) = 1 1+e −z .
Note: For both models, we used binary cross entropy as our loss function. We submitted results of both systems (Model I & Model II).

Experimental Setup
Our model is implemented in Tensorflow 2 and Keras 3 . We use a batch size of B = 500, we train our neural network for 25 epochs with the Adam optimizer. A dropout and recurrent dropout of 0.25 is used. Complete code is made available on Github 4 . During the pre processing stage of data we removed all unwanted symbols and user mentions. Largeuncased BERT model is employed for obtaining tweet embeddings. We also analysed how accuracy and loss of max pooling and average pooling changes with number of epochs in different contextual embeddings (Devlin et al., 2019) (Peters et al., 2018) (Yang et al., 2019) complete code and plots are uploaded in our repository.

Results
In order to illustrate the efficacy of our proposed methods, we compare the results with simple average pooling and max pooling on validation set in Table 2. Results in Table 2 are average of 5 runs of each model. From this Table we can infer that our proposed models perform better than existing approaches. In Figure 3 and 4 we illustrated how loss vary with number of epochs on validation data.  From Table 2 we can infer our proposed models (section 3.4 and section 3.5) works than simple average or max pooling. Results of our proposed models on test data are showed in Table 3, we achieved an F1 score of 0.7979 using Model II and 0.7932 using Model I.

Conclusion
In this paper we presented our system to WNUT 2020 shared task on "Identification of informative COVID-19 English Tweets". Traditional text classification models mainly focus on three topics: feature engineering, feature selection and using different types of machine learning algorithms. Our main goal in this paper is to show how we can leverages on different pooling methods of BiLSTM, without using any human-engineered features and improve efficacy of any system. We believe performance of our system can be further improved by tweaking hyper parameters. In future we would like to explore how our models perform with different attention mechanisms (Vaswani et al., 2017) for different tasks like relation classification (Zhou et al., 2016), image captioning , and machine translation (Bahdanau et al., 2014).