Cross-Lingual Disaster-related Multi-label Tweet Classification with Manifold Mixup

Distinguishing informative and actionable messages from a social media platform like Twitter is critical for facilitating disaster management. For this purpose, we compile a multilingual dataset of over 130K samples for multi-label classification of disaster-related tweets. We present a masking-based loss function for partially labelled samples and demonstrate the effectiveness of Manifold Mixup in the text domain. Our main model is based on Multilingual BERT, which we further improve with Manifold Mixup. We show that our model generalizes to unseen disasters in the test set. Furthermore, we analyze the capability of our model for zero-shot generalization to new languages. Our code, dataset, and other resources are available on Github.


Introduction
In times of disaster, affected individuals often turn to social media platforms, such as Twitter or Facebook, to express their feelings generated by a disaster, update friends and relatives on their status, request help or supplies, or report useful information to the disaster response teams. Response organizations can use social media to increase situational awareness by providing information about disaster status, ongoing rescue operations, and disaster warnings (Palen and Hughes, 2018). However, the low entry-barrier of social media platforms, where everybody can post their own "news" in realtime, leads to information overload, making it hard for users to find relevant and useful information (Reuter et al., 2018). Thus, it is crucial to filter out the non-informative messages, and to distinguish among different categories of informative messages to ensure that a message reaches its target users. In turn, this can help facilitate disaster response and increase situational awareness.
Towards this goal, in recent years, many works have focused on disaster-related tweet classification (Alam et al., 2018b;Mazloom et al., 2018;Nguyen et al., 2017;Li et al., 2017;Neppalli et al., 2018;Caragea et al., 2016Caragea et al., , 2011. However, most of these works have focused on the classification of English tweets only, with a few notable exceptions (Musaev and Pu, 2017;Khare et al., 2018;Lorini et al., 2019;Torres et al., 2019). We stress that there are a lot of disaster-prone non-Englishspeaking countries, which could benefit from a multilingual classifier that can be used in real-time to identify useful information on social media. Furthermore, there is a lack of a large scale standard multilingual disaster-related dataset for multi-label classification with diverse disaster types. Against this background and needs, we make the following contributions: 1. We aggregate existing datasets into a large disaster dataset using a new annotation scheme. Furthermore, by utilizing a class-mask (elaborated in Section 4.1), we make use of both binary-classification data and multi-class classification data in the same training phase.
2. We explore Manifold Mixup (Verma et al., 2019) in the natural language-based disaster domain. Manifold Mixup is a regularization technique originally introduced in computer vision tasks.
3. We employ Multilingual BERT (Devlin et al., 2019) to train multilingual classifiers. We demonstrate its generalization on unseen disasters and its zero-shot transfer-ability to languages not present in the training data.

Related Work
There are numerous prior works on disaster-related tweet classification. For example, Imran et al. (2013; focus on classifying and extracting actionable information from disaster-related tweets, assuming that sufficient labeled tweets from the ongoing disaster are available for model training. Later, Imran et al. (2016b) explore real-time classification of tweets from a target disaster using models trained on past disasters. Nguyen et al.
(2017) introduce a Convolutional Neural Network that performs robustly even on out-of-event data during inference. Other works explore domainadaptation that uses labeled tweets from past disasters and unlabeled tweets from an ongoing disaster Alam et al., 2018a). Kruspe (2019) take a few-shot learning approach, in which a disaster-specific model is trained using only a few (around 10) examples for disaster-related tweet detection. In contrast, we train a universal model on diverse disaster types for fine-grained classification and show that it performs remarkably well on unseen disaster types without further training (specifically, it achieves zero-shot generalization to unseen events). Wang and Lillis (2019) classify actionable tweets using ELMo contextual word embeddings, whereas Ma (2019) uses a monolingual BERT-based model for disaster-related tweet classification. In contrast, we work with a multilingual model, which we compare with multiple baselines, and augment with Manifold Mixup. Regarding cross-lingual approaches, Dittrich and Lucas (2014) present a real-time application tool for multilingual tweet classification and disaster detection. However, this tool requires a long training phase with tweets from specific areas for robust detection, and its multilingual classifier filters messages based on shallow matching of pre-selected keywords (and their translations). Musaev and Pu (2017) construct a multilingual model for tweet classification using multilingual Wikipedia articles as knowledge repository. Khare et al. (2018) also take into account cross-lingual capabilities, however, this is limited to the fixed few number of languages that are present in their annotated training data and do not generalize to new languages without further training. M-BERT overcomes these shortcomings. Similar to us, Lorini et al. (2019) use multi-lingual word embeddings for cross-lingual classification, but they use noncontextual embeddings. Torres et al. (2019) use contextualized word embeddings for cross-lingual analysis, but only on limited samples (8K) and only for two languages (English and Spanish).
A few recent works (Pires et al., 2019;K et al., 2020) also demonstrate the strong cross-lingual and zero-shot transfer capabilities of M-BERT, but not in the disaster domain.

Aggregated Dataset
To prepare our large multilingual dataset, we aggregated several resources from CrisisNLP, 2 together with two resources from CrisisLex. 3 Specifically, we used Resource #1 (Imran et al., 2016a), Resource #4 (Nguyen et al., 2017), Resource #5 (Alam et al., 2018c), and Resource #7 (Alam et al., 2018a) from CrisisNLP, and CrisisLexT6 (Olteanu et al., 2014) and CrisisLexT26 (Olteanu et al., 2015) from CrisisLex. The original classes in each resource, together with the mapping to the new classes included in our data set, can be seen in Table 1. Some examples from the dataset are shown in Table 2. For the dataset construction, the following classes were included:

Casualties and Damage (C & D): This class
consists of tweets related to affected individuals, displaced people, building collapse, rescue operations, infrastructure and utilities damage, needs of affected people, missing or trapped people, and other topics related to situational awareness and disaster response.

Donation and Volunteering (D & V):
This class consists of tweets related to donations, volunteering requests, and other needs and requests targeted to individuals following the disaster and/or supporting the victims.

Caution and Advice (C & A):
This class consists of tweets recommending caution, expressing warnings, or providing advice regarding the crisis situation. Such tweets are useful for the affected individuals.

Informative (I):
This is a general class, which includes: tweets belonging to any of the above three classes; tweets with niche categories that do not fit into the above classes; tweets with more vague classes (e.g., "other useful information"); and tweets originally labeled with only binary classes such as relevant or informative.   Some of the above classes (for example, Casualties and Damage) are very broad and could be broken down into more specific classes. However, keeping them broad simplifies the aggregation of different annotation schemes and prevents the formation of multiple fine-grained but sparse classes. During aggregation, we treat the first four classes as mutually exclusive (they are also mutually exclusive with the Non-Informative class). We filter out duplicate tweets. For duplicates from different resources that were originally associated with more than one mutually exclusive classes, we keep only the first class, based on the order in which classes are listed above.
Statistics about the final dataset with respect to the number of tweets per class and per language are shown in Table 3

Classification Approach
In general, all of our models use a sentence encoder to map a tweet to a single vector sentence representation. The vector is then fed to multiple binary classifiers. Specifically, we train four classifiers. One classifier distinguishes between Informative and Non-Informative classes, while the other three classifiers correspond to the remaining three classes: Casualties and Damage, Caution and Advise, and Donations and Volunteers, respectively (each classifier predicts whether a tweet belongs to a particular class or not). We should note that there are many tweets belonging to the Informative class, which originally only had binary classes (informative/noninformative or relevant/non-relevant). While those tweets may also belong to one of the more finegrained classes, their class could not be determined, if it was not available in the original resources. In other words, many of the samples in the dataset are partially labeled (where the binary "informative" or "Non-Informative" class is present but the other fine-grained class information is absent). However, ignoring all partially labeled tweets would result in removing nearly half of the data. In order to get the benefit from the binary-classification-only data while also enabling the same model to work on multi-label classification we devise a label masking strategy. Precisely, the mask is used to ensure that the loss signal is only propagated from classes which are annotated. The strategy is discussed in further details below.
By default, we use the negative class for the three fine-grained categories as dummy ground-truth for such cases. We then mask out (i.e., zero out) the loss from the dummy ground truth cases during training. For masking the loss from dummy ground truth, we use a class mask m ij (i.e., a mask for the j th class and the i th sample), where m ij is 0 if the actual j th class ground truth is not present for the i th sample, otherwise it is 1. Overall, we use binary cross entropy for each of the classifiers with the class masks and class weights. The loss function can be formalized as: where K is the number of classes, N is the number of samples, c j is the class weight for the j th class, x i is the i th tweet string, θ represents the model parameters, and P θ (y ij |x i ) is the model prediction for the i th tweet string and the j th class. We use class weights to handle class imbalance. We consider the cost of filtering out an important and urgent tweet to be higher than the cost of including a noninformative tweet. This is why we bias our model towards recall by using class-weights of value ≥ 1 for the positive classes. We use a class weight of 1 for the Informative versus Non-Informative classes (as they are fairly balanced, with already a small bias towards the positive class). For the fine-grained classes, we use the following formula to find the class weights: We should note here that the loss function does not take the positive classes as mutually exclusive since, in principle, a single tweet could have multiple classes (for example, a tweet could have both 'Caution and Advice' and 'Casualties and Damage'). 4

in our main model (M-BERT).
Mixup (Zhang et al., 2018) was originally introduced in the image classification domain as a data augmentation based regularization technique. The original technique augments data by linearly interpolating two different input data samples and their associated classes. In effect, this helps make a model more robust by inducing a linear behavior inbetween training samples. Guo et al. (2019) show that Mixup both at the level of word embeddings and at the level of sentence embeddings (output of sentence encoder) is effective for text classification. Manifold Mixup is a more recent variant of the original input Mixup, where the hidden states of two different data samples, along with their associated classes, are linearly interpolated. To do this, a mixup ratio λ is sampled from a Beta distribution, as: λ ∼ Beta(α, α).
Next, a hidden layer l is randomly chosen for mix up. Let h l i be the randomly chosen l th hidden  layer output from the i th tweet sample, and let h l j be the hidden layer output from the j th sample. The two outputs can be mixed up as follows: whereh l i is the augmented (mixed-up) hidden state. We use the same λ to mix the hidden states of the tweet samples i and j, and also the corresponding ground truth classes and class masks for each class k included in our dataset: where,ỹ ik andm ik are the corresponding mixed-up class and class-mask, respectively.

Experimental Setup
We use four datasets for testing: Russia Meteor, Cyclone Pam, Philippines Flood, and Mixed disasters.
To demonstrate the generalization capabilities of our models, we ensured that the first three datasets are from disasters that are absent in the training set. For M-BERT-based models, we use a mini batch size of 32, a learning rate of 10 −3 for non-BERT parameters, and a fine-tuning rate of 2 × 10 −5 for M-BERT parameters. We set the parameter α of the Beta distribution for the Mixup equation to 2. We run each model five times and report the mean and standard deviation of the results obtained in the 5 runs. For the other models, we import the parameter settings from their corresponding paper and then perform light manual tuning. The exact hyperparameters are available on Github. 5 . For significance testing, we used the paired t-test (p ≤ 0.05) (Dror et al., 2018) Note that the CNN baseline is also similar to the model used by Nguyen et al. (2017) which was demonstrated to be a strong performer in disaster-related classification.

Results
In Table 4, we show the results on English only samples, and in Table 5, we show the results on the full multilingual test sets. As can be seen in Table 4, the M-BERT outperforms all the non-BERT baseline models. Using Manifold Mixup consistently increases the performance of base M-BERT in all cases, often also working better than Word Mixup and Sentence Mixup, especially for the multilingual setting (see Table 5). Manifold Mixup   either outperforms or is very close to the other Mixup techniques. Table 6 shows the results of the cross-lingual experiments with M-BERT and Manifold Mixup for French, Italian, and Spanish languages, respectively, in a zero shot setting (Pires et al., 2019), where no tweets in the test language are included in the training set.
As can be seen from the table, the zero shot F1score on Spanish is 85.25% (which is comparable to the best results in the previous experiments), despite the fact that no Spanish tweets were included in the training. The zero shot F1-scores on French and Italian are 81.33% and 75.44%, respectively. These results show that the M-BERT+Manifold Mixup model has good generalization capability in the new language (zero shot) setting. Thus, we can conclude that our M-BERT+Manifold Mixup model has great capability to generalize to a disaster in a new language (unseen in the training set) as long as the language is one of the 104 languages on which M-BERT was pre-trained. This is a strong result given that disasters can happen in countries with limited resources for automated classification of social media information.
In Table 7, we check the binary classification performance of M-BERT+Manifold Mixup for each class. As we can see, our model achieves an F 1 above 90% for the binary classification task of distinguishing whether a tweet is informative or not. Interestingly, it does not perform too poorly on Caution & Advice either despite having very limited samples for this class in the training set.

Conclusion
We present a way to aggregate prior disaster-related resources to compile a large scale tweet dataset for multi-label classification utilizing both multi-class classes and binary classes. We motivate the use of M-BERT for disaster-related tweet classification and we demonstrate its strong performance on unseen disasters and languages. We also motivate the use of Manifold Mixup for further improvement. In the future, it would be interesting to explore weak suervision and other data augmentation techniques to improve models' robustness further.