Detecting Urgency Status of Crisis Tweets: A Transfer Learning Approach for Low Resource Languages

We release an urgency dataset that consists of English tweets relating to natural crises, along with annotations of their corresponding urgency status. Additionally, we release evaluation datasets for two low-resource languages, i.e. Sinhala and Odia, and demonstrate an effective zero-shot transfer from English to these two languages by training cross-lingual classifiers. We adopt cross-lingual embeddings constructed using different methods to extract features of the tweets, including a few state-of-the-art contextual embeddings such as BERT, RoBERTa and XLM-R. We train classifiers of different architectures on the extracted features. We also explore semi-supervised approaches by utilizing unlabeled tweets and experiment with ensembling different classifiers. With very limited amounts of labeled data in English and zero data in the low resource languages, we show a successful framework of training monolingual and cross-lingual classifiers using deep learning methods which are known to be data hungry. Specifically, we show that the recent deep contextual embeddings are also helpful when dealing with very small-scale datasets. Classifiers that incorporate RoBERTa yield the best performance for English urgency detection task, with F1 scores that are more than 25 points over our baseline classifier. For the zero-shot transfer to low resource languages, classifiers that use LASER features perform the best for Sinhala transfer while XLM-R features benefit the Odia transfer the most.


Introduction
People all over the world use social media, e.g. Twitter, Facebook, to communicate with the outside world during crises that are either natural or man-made. During an emergent crisis, people post to report their well-being, ask for help, or give updates about the ongoing situation. This type of text data can be utilized to provide situational awareness to support missions such as humanitarian assistance/disaster relief, peacekeeping or infectious disease response. However, with the existence of more than 7,000 languages worldwide, automated human language technology does not exist for many languages. 1 A possible solution to this problem is to transfer models learned in high resource language settings such as English to low resource languages . In addition, there has been significant research in the use of transfer models in semantic analysis of texts such as sentiment (Socher et al., 2013;Rasooli et al., 2018) and emotion (Tafreshi and Diab, 2018).
To this end, we collect and release English, Sinhala and Odia urgency datasets that consist of tweets relating to natural crises, annotated with urgency status. 2 To demonstrate that we are able to effectively transfer the task of urgency detection from English to low-resource languages, we use English annotated tweets for training, and Sinhala/Odia annotated tweets for evaluation only, therefore, exploring zeroshot transfer. Specifically, we consider the following two tasks: a) English classification, for which we hold out 20% of the English dataset for evaluation, and the remaining 80% for training; b) cross-lingual classification, for which we use the entire English dataset for training and the corresponding Sinhala or Odia dataset for evaluation. For the English classification task, we implement classifiers of different architectures adopting various embeddings including BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) and XLM-R (Conneau et al., 2020), which we use to extract features, then train a classifier that takes the contextualized representations of the tweets into account. For the cross-lingual classification task, we build classifiers using the same set of architectures, but deploying various cross-lingual embeddings that are constructed using different methods: LASER (Artetxe and Schwenk, 2019) and XLM-R (Conneau et al., 2020). For both tasks, we employ semi-supervised approaches by generating pseudo-labels for a large amount of unlabeled tweets that are crisis related, in order to improve system performance. Last but not least, we ensemble different classifiers to boost performance further.

Dataset
Tweets about many natural and human-induced disasters such as earthquakes, typhoons, and landslides are collected by . We annotate a subset of them at the tweet-level on the Figure-Eight data annotation platform 3 as seen in Figure 1. The annotation tag set comprises the following four levels of urgency: • Extremely Urgent: aspects of the tweet refer to an extremely urgent and difficult situation; e.g. MT @SushmaSwaraj my uncle is in kathmandu, trapped, suffers from jaundice, chest infection, diabetes, his number #NepalQuake • Definitely Urgent: tweet contains content that is urgent but the level of urgency is not as high; e.g. @MountainGuides1 Please help us find my friends parents Last heard from on way to Everest base camp.#NepalEarthquake • Somewhat Urgent: tweet contains some content that could be considered urgent but it is not as certain as the two categories above; e.g. MT @dineshakula Med supplies required in Bir Hospital. Out of medical supplies http://t.co/4pPhg2aVhg #Kathmandu #NepalQuake #hmrd • Not Urgent: tweet does not include any content that can be considered urgent.
e.g. Prayers and thoughts with those affected by the earthquake As it can be difficult for an annotator to decide whether a tweet is urgent, we provide four scales of urgency which then can be converted to a binary set of tags: Urgent vs. Not Urgent. The level of agreement between multiple contributors, confidence score, is weighted by contributors' trust scores and calculated on average as 68.6%. 4 52 test questions with correct labels were distributed throughout the task for which the annotators needed to maintain a 70% accuracy.
After removing duplicates and inconsistencies, the final data consists of 1, 919 annotations as summarized in Table 1. To map the 4 multiple labels to a binary representation, the Not Urgent and Somewhat Urgent are mapped to False label, whereas the remaining two labels are mapped to True. This yields a binary dataset of an urgent ratio of 26.7%. One of the advantages of having a fine-grained label structure is to be able to capture the intensity level of urgency. In addition, depending on the situation, binary urgency levels can be adjusted to reflect that e.g. higher urgency percentages for a dire situation and lower urgency ratios for a less critical incident.
However, this also might have caused the annotation task to be more challenging. When we analyzed the annotations, we noticed that some of the tweets about rescue efforts were particularly confusing: tweets that are general status updates about an incident and more critical tweets that are asking for help are both labeled as urgent, without making a distinction between the two. This demonstrates one of many aspects of the difficulty of annotating for urgency, partially due to the tendency of labeling a tweet as urgent even though the urgency of the event is past.

Low Resource Languages
Linguistic Data Consortium (LDC) incident language (IL) packages are produced for the Low Resource Languages for Emergent Incidents (LORELEI) program. They cover a range of genres from formal news to informal social media, blogs and reference materials such as Wikipedia. They include parallel corpora that has sentence-level aligned data in English and the IL. The languages Sinhala 5 (IL10) and Odia 6 (IL11) are annotated at the sentence-level by native informants for urgency in a binary label distribution as illustrated in Table 2. Both languages are Indo-Aryan languages; Sinhala is spoken primarily in Sri Lanka and Odia is spoken in the Indian state of Odisha.

Methodology
We explain our approaches to data preprocessing, English monolingual and low resource cross-lingual classification in Sections 3.1, 3.2 and 3.3, respectively.

Preprocessing of Tweets
We adopt the tweet preprocessing procedure as described in CrisisNLP (Nguyen et al., 2016) which removes URLs, special characters, and converts to lowercase. In addition, we remove usernames and segment hashtags using a word segmentation tool 7 e.g. #NepalEarthquake becomes nepal and earthquake. We apply the same preprocessing procedure to English, Sinhala and Odia, with the exception of hashtag segmentation for Sinhala/Odia.

English Classification
To start with, we build classifiers for detecting urgency given tweets in English to establish an understanding of the baseline performance of this task without the effect of transferring between languages.

Monolingual Embeddings
For all of our classifiers, we first use word/sentence embeddings to extract features of the input tweets. We experiment with the following variations when choosing the English embeddings: contextual and non-contextual, and out-of-domain and in-domain. We choose two non-contextual embeddings: fastText embeddings (Bojanowski et al., 2017) and CrisisNLP embeddings (Nguyen et al., 2016). fastText embeddings are trained on texts from Wikipedia and Common Crawl (both out-of-crisis-domain) whereas CrisisNLP embeddings are trained on disaster related tweets, i.e. in-domain. Both embeddings project each word in a sentence to a 300 dimensional vector representation. We also use BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) and XLM-R (Conneau et al., 2020) to generate contextual representations of the tweets for English. 8 A list of embeddings and their availability for each language is shown in Table 3.

Classifier Architecture
Since we have a limited amount of annotated English tweets, we adopt relatively simpler models such as Support Vector Machines (SVM) and Random Forest, as well as shallower neural networks: Multi-Layer Perceptron (MLP) and Convolutional Neural Networks (CNN) as classifiers (Nguyen et al., 2016).The task is a binary classification task where the labels correspond to urgency status of Urgent or Not Urgent. The inputs to these classifiers are features of the tweets extracted by various embeddings mentioned in Section 3.2.1. For MLP classifiers (shown in Figure 2), we use sentence representations that are either contextual or using the inverse-document-frequency (idf) weighted average of the word embeddings, with the idf-weight of each word computed on the entire English dataset. Next, we apply a sequence of dense layers with batch normalization, Rectified Linear Unit (ReLU) activation, and dropout layers in between. We empirically decide on the hyper-parameters where the optimal sequence of dense layer width as 1, 024, 512, and 64. For CNN classifiers, we use the same architecture proposed in Crisis NLP (Nguyen et al., 2016). Specifically, we apply a convolutional layer, followed by batch normalization, ReLU activation and a max-pooling layer. Finally, we apply a dense layer after flattening the previous CNN layers outputs.

Data Augmentation
We experiment with a semi-supervised training scheme to augment the training dataset (shown in Algorithm 1). We adopt self-training approaches (Yarowsky, 1995;McClosky et al., 2006), in which we add the best performing classifier's predictions on unlabeled data to the initial training dataset, which is manually annotated. We sample the unlabeled tweets from the same collection of disaster related tweets  where we select and annotate a subset to create our English training dataset as described in Section 2, and we make sure the set of the unlabeled tweets and the set of training data Algorithm 1 Incremental Training Workflow Let source language training dataset be S Let unlabelled source language dataset be U Let target language testing set be T while |S| < 16k do Train 3 classifiers of the same type C 1 , C 2 and C 3 on S independently Predict the labels L using C 1 (U ), C 2 (U ) and C 3 (U ). Retrieve a subset U 0 from U where all classifiers agree, and the corresponding label set L 0 . Break if |U 0 | = 0 S ← S ∪ (U 0 , L 0 ) end for Train 3 classifiers C 1 , C 2 and C 3 on S independently Output classifier C(T ) = Majority vote among C 1 (T ), C 2 (T ) and C 3 (T ).
are disjoint. To enforce consistency and reduce the bias of predictions on unlabeled data, we leverage the agreement of three independently trained best performing English classifiers, which are trained on RoBERTa features in this case, by adding a tweet to the training data only if all three classifiers yield the same prediction. After a round of predictions, we obtain a larger training dataset, on which we train another three independent top-performing classifiers and conduct a second round of predictions on the remaining unlabeled tweets. We conduct multiple rounds of the above procedure until no more remaining tweets would get the same prediction, and finally we obtain 16, 243 samples (including the original 1, 952 labeled samples) for training. We also experiment with varying the size of the synthetic data utilized, 3K, 10K, and 20K. We observe that 16K yields the best performance.

Ensemble Model
Algorithm 2 Ensemble Workflow Let source language training dataset be S Let unlabelled source language dataset be U Let testing set be T . for each classifier type e do Incrementally train classifier C e on S and U independently end for Majority vote among C e (T ) To further improve the performance of the urgency detection system, we ensemble various classifiers by vote. Instead of doing a classic majority vote, we adapt a more aggressive voting strategy that predicts positive if any of the independent models yields positive predictions. This allows us to achieve a better recall so that more urgent messages will be reported.

Cross-lingual Classification
The major component of the cross-lingual classification task is the cross-lingual embeddings that are the inputs to the classifiers whose architectures are similar to those for the English tasks. By training classifiers with these features, we are able to transfer the task of urgency detection from English to Sinhala and Odia. The entire process of our transfer approach is shown in Figure 3 and Algorithm 1.

Cross-lingual Embeddings
To generate a cross-lingual embedding that can be used to transfer from English to Sinhala, we use a parallel corpus that contains English-Sinhala sentence pairs as well as pre-trained English embeddings and Sinhala embeddings. There are many approaches for generating cross-lingual embeddings given the above resources, but in our study we focus on the projection-based methods of training the embeddings: VecMap (Artetxe et al., 2018) and Proc-B (Glavaš et al., 2019). As a first step, we use fast-align tool (Dyer et al., 2013) to create symmetric word alignments between source and target words given the parallel corpus, then we choose the most frequent translation for each word (Rasooli et al., 2018). This generates a bilingual dictionary with 72K approximate vocabulary size for each language, which is used as a seed dictionary to generate the cross-lingual embeddings by projecting the pre-trained English and Sinhala monolingual embeddings to the same semantic space. We employ the same procedure to generate the English-Odia embeddings as well, given the English-Odia parallel corpus and the pre-trained English and Odia embeddings. For all the pre-trained monolingual embeddings (English, Sinhala and Odia), we use fastText (Grave et al., 2018) embeddings which are trained on Common Crawl and Wikipedia. In addition, we use pre-trained contextual cross-lingual embeddings that are publicly available: LASER (Artetxe and Schwenk, 2019) a pre-trained cross-lingual embedding trained on texts that are in 93 languages including Sinhala. XLM-R (Conneau et al., 2020) is trained on Common Crawl text data in 100 languages, including Sinhala and Odia.

Experiments
We report Macro Precision, Recall and F1-scores for the English classification task and the cross-lingual classification tasks in Tables 4, 5 and 6 respectively (best scores for each section are underlined and shown in bold). For macro averaging, we calculate precision, recall and F1 scores for both positive and negative labels, then report their unweighted mean. Column Original refers to results using the original  dataset that is human-annotated and 16K with Synthetic Data refers to results on larger datasets that are generated with the method mentioned in 3.2.3. Since the evaluation datasets for all the tasks are small in size, for each experiment setting that is cheap to reproduce, we report the mean and standard deviation of 30 independent experiments to reduce inconsistencies and improve confidence. As a baseline, we report results of a classifier which assigns a label randomly based on the label distribution in English training data. We use scikit-learn (Pedregosa et al., 2011) for SVM and random forest classifiers and the PyTorch platform 9 for the deep learning classifiers. For incorporating all the deep pre-trained contextual models, our codebase heavily relies on the transformer implementations by Huggingface, which has the advantage of switching to any future large-scale pre-trained models easily. 10

Analysis
For the English classifier, we observe the following: • Deep pre-trained models that produce contextual representations of the tweets benefit the task of urgency detection the most even in the presence of a limited amount of data. We believe this is because these models generally produce representations that are better in quality and higher in dimension. We tried fine-tuning these pre-trained models and found that the performance deteriorates evidently in the case where the down-stream task has a very small dataset (Goodfellow et al., 2016); • In-domain embeddings (CrisisNLP) are consistently better than out-of-domain embeddings (fast-Text) for English classification tasks; • The semi-supervised approach of augmenting the dataset does not necessarily boost the performance even further in the case when the pseudo-labels are generated by classifiers that are trained on limited resources.
For cross-lingual classifiers, the following analysis can be concluded: • Between Vecmap and Proc-B, we see similar performance across languages and classifiers. This could be the case because they both have similar approaches: projection based approaches to generating cross-lingual embeddings.
• Adding synthetic data consistently improves the performances of the classifiers for Odia but not for Sinhala (if we consider macro F1-score). We suspect that synthetic data might help the task if the label distribution of the training dataset is similar to that of the evaluation dataset. The difference of such urgent tweets ratio is larger in Sinhala (7.7% and 18.5%) than such difference in Odia (16.1% and 18.5%), therefore producing results that are not any better in the presence of synthetic data. The urgent tweet ratio is shown in Table 7 for reference; • For the English-Sinhala task, we observe that the LASER-based classifier yields better performance. This could be due to the fact that a) LASER uses bigger parallel corpora i.e. 796,000 sentences b) LASER is a sentence-level contextual embedding, which is better than the order-independent idf-weighted-averaging-based method of producing the sentence representations that the rest of the classifiers adopt.  For both monolingual and cross-lingual classification, MLP-based classifiers with idf-weighted averaging of the word embeddings are consistently better than CNN-based classifiers. We observe that when the large amount of synthetic data is present, CNN classifiers have improvements that are more significant than those of MLP classifiers, comparing to training them on the original dataset. After adding the synthetic data, both CNN and MLP classifiers yield similar performance. Finally, ensemble by aggressive voting strategy leads to better classification performance in both English and cross-lingual tasks, as shown in Table 8.

Related Work
Crisis NLP 11 website provides social media datasets and classifiers that are about various disasters in several languages, i.e. English, Spanish and French, which are all high resource languages. For low resource languages, due to very limited amount of data, transfer learning approaches must be adapted that transfer a high-resource model to a low-resource language Chaudhary et al., 2019). The work of Kejriwal and Zhou (2019) apply a manual feature based approach to transfer urgency labels from English to several low resource languages combined with active learning to increase the amount of labels. Recent successful techniques in transfer learning, however, use cross-lingual embeddings combined with deep learning based classifiers. Cross-lingual embeddings map words in different languages into same semantic space and among them, we use projection based approaches, i.e. VecMap and ProcB, rather than parallel corpora based ones e.g. BiSkip (Luong et al., 2015) due to their superior performance. This has been shown to work well for sentiment (Socher et al., 2013;Rasooli et al., 2018) and emotion (Tafreshi and Diab, 2018). In addition, after the success of contextual language models such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) in many NLP tasks, their multilingual versions became available i.e. Multilingual BERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) which we experimented with based on their availability for our languages. Their adaptation to low resource settings, e.g. fine-tuning with small datasets, is not trivial and is not as reliable as in high resource settings. As such, we show how this can be achieved with our experimental setup. Specifically, we use a self-learning method by voting (Zhou and Goldman, 2004) to increase the size of the high resource language dataset on unlabelled Crisis NLP tweets. We decide not to use tri-training (Zhi-Hua Zhou and Ming Li, 2005;Ruder and Plank, 2018) due to the size of original English data despite the fact that tri-training has shown good results in domain-shift NLP tasks.

Conclusion
In this study, we release an urgency dataset consisting of English tweets about natural crisis and their urgency status. In addition, we release two evaluation datasets for urgency detection in Sinhala and Odia. We train monolingual classifiers for English and cross-lingual classifiers for Sinhala and Odia that are zero-shot learners. For the design of our classifiers, beside exploring different architectures, we adopt different monolingual or cross-lingual embeddings that are either pre-trained or constructed by using different methods. Due to limited amount of labeled data, we generate synthetic data to improve the system performance, and ensemble classifiers to boost the performance even further. We conclude that if synthetic data can be produced with high confidence, then it is helpful in transfer between domains that have similar distribution of labels. Specifically for English urgency detection, the best performing classifier utilizes contextual features produced by pre-trained RoBERTa model and among non-contextual embeddings, in-domain embeddings out-perform out-of-domain embeddings. For cross-lingual transfer, classifiers that incorporate LASER features perform the best for transferring to Sinhala while XLM-R features benefit the most in transferring knowledge of urgency detection to Odia. Finally, in the absence of pre-trained contextual embedding for a low resource language, we also demonstrate alternative ways to achieve similar performance using cross-lingual embeddings constructed by projection based approaches, i.e. VecMap and ProcB.