Improving Sentiment Analysis over non-English Tweets using Multilingual Transformers and Automatic Translation for Data-Augmentation

Tweets are specific text data when compared to general text. Although sentiment analysis over tweets has become very popular in the last decade for English, it is still difficult to find huge annotated corpora for non-English languages. The recent rise of the transformer models in Natural Language Processing allows to achieve unparalleled performances in many tasks, but these models need a consequent quantity of text to adapt to the tweet domain. We propose the use of a multilingual transformer model, that we pre-train over English tweets on which we apply data-augmentation using automatic translation to adapt the model to non-English languages. Our experiments in French, Spanish, German and Italian suggest that the proposed technique is an efficient way to improve the results of the transformers over small corpora of tweets in a non-English language.


Introduction
Monitoring social media at the scale of a continent, like Europe, requires to process multiple languages. Data from Twitter is noisy and can drastically change in term of words distribution in one language when compared with general texts. It becomes even more challenging when tackling a multilingual task.
In accordance with the RGPD, it is impossible to make available tweets that have been deleted by their authors. This makes it more difficult to find tweet corpora annotated in sentiment. The SemEval challenges assure a big database of tweets in English, with a total of more than 62k examples annotated at the level of the tweet (Rosenthal et al., 2018). For other languages, this is more complicated.
Bidirectional transformers like BERT (Devlin et al., 2018) revolutionized the world of Natural Language Processing (NLP). Even when pre-trained over text from general domain, these models need a substantial amount of data to adapt to a domain where the syntax is different, like Twitter (Nguyen et al., 2020). This paper presents the experiments carried out on several datasets of tweets in five different languages: English, French, Spanish, German and Italian. The general idea is pretty simple: instead of using a monolingual model, we chose to use a multilingual model that we can train over a large dataset of English tweets, over the original non-English tweets and over their automatic translations. We chose the multilingual transformer model XLM-RoBERTa from (Lample and Conneau, 2019) with a dataaugmentation technique using machine translation. We investigated the effects of pre-training with English data and data-augmentation. We also compared performances of multilingual models against their monolingual French (Martin et al., 2020) and English (Liu et al., 2019) counterparts and found interesting improvements.

State of the Art
The related works section of this paper is shared between multilingual sentiment analysis, dataaugmentation and sentiment analysis over tweets.

267
We can count several challenges tackling sentiment analysis over tweets like SemEval (Nakov et al., 2013) in English, TASS (Villena-Román et al., 2013) in Spanish or DEFT (Hamon et al., 2015) in French. For the last years, the neural networks are ruling the sentiment analysis over tweets. (Cliche, 2017) won the sentiment polarity subtask of the SemEval-2017 challenge (Rosenthal et al., 2018) using a neural network approach. (Singh et al., 2019) improve the state-of-the-art with a different incorporation of the emojis: they use the descriptions instead of their unicodes before a BLSTM with word embeddings. Finally, (Nguyen et al., 2020) proposed a BERT model with the RoBERTa pre-training procedure (Liu et al., 2019) over 850M English tweets. This model gives state-of-the-art results on the SemEval-2017 dataset, but like all the other models, it is not adapted to multilingual data.
Although data-augmentation is not as developed for textual data as it is for images, there are ways to apply it to text (Wei and Zou, 2020). (Sennrich et al., 2016) augment their datasets using back-translation for pairing sentences for a MT task. (Kobayashi, 2018) uses language models in context in order to create new plausible sentences. (Fadaee et al., 2017) use data-augmentation for MT, by automatically translating at the level of words using plausible substitutions in order to create a new sentence.
The work that is closest to the one we are presenting in this article remains that of (Balahur and Turchi, 2013;Balahur et al., 2014) who tackle a sentiment analysis task over multilingual tweets. They show that use of multilingual, machine-translated data can help to better distinguish relevant features for sentiment classification, using SVM models with Bag-of-N-Grams. We distinguish from this work by using real datasets for testing instead of artificially created test sets made of translated tweets re-edited by humans.

Proposed Method
The method we propose is very simple. It basically consists in using a multilingual model instead of a monolingual model, pre-trained it over available annotated English tweets, combined with a dataaugmentation techniquethat is based on MT. With this augmentation, we have each tweet in five examples, in five different languages. The languages that we use for the tweets are the same languages that the datasets we used to test the tweets: French, English, German, Spanish and Italian.

Pre-training over External Datasets
As we said earlier, we found it was more difficult to find tweet datasets in languages that were not English. We then investigated the potential of using multilingual model pre-trained over English tweets only, and over English tweets automatically translated in other languages.
We investigated the effect of using other available English datasets with multilingual model. To that end, we pre-trained the neural network with tweets annotated for the SemEval-2013 to SemEval-2016 challenges. We used the original tweets in English, but also their automatic translations in the 4 other languages we studied, leading to a 5 times bigger training dataset.

Data Augmentation and Multilingual Training
Translating the tweets into other languages allows our model to see 5 times the amount of data that it should have originally seen. The translations from the source language to the 4 other languages were made by the automatic translation tool of the European Commission, which is comparable to Google Translate. You can find examples of tweets and their translation in Table 1. It is important to note that the quality of the translation is not optimal since the translator has been learned over general text, whereas tweets can be noisy data containing abbreviations, and modernisms.
Finally, we always fine-tune the model over the non-English target datasets. The results of the models that are not fine-tuned on the target languages are poor and not even reported in this article.

Methodology
The pre-trained models that we used were made available online using the transformers library (Wolf et al., 2019). The same library has been employed for the training of the models. We used the Adam algorithm (Kingma and Ba, 2014) with early stopping for the optimization of the training loss, using a learning rate of 2e −6 for the pre-training of the model over English tweets, and 5e −7 for the fine-tuning over non-English tweets. We computed the performance on the development set after each training epoch, and kept the model obtaining the best performance. We used a batch size of 32.

Lang.
Tweet English I'd rather dump gasoline all over myself and run into a burning building than use Excel. French Je préférerais jeter de l'essence partout et tomber dans un immeuble en feu plutôt que d'utiliser Excel. German Ich würde lieber Benzin auf mich werfen und in ein brennendes Gebäude laufen, als Excel zu benutzen. Spanish Prefiero tirar gasolina sobre mí mismo y correr hacia un edificio en llamas que usar Excel. Italian Preferirei buttarmi la benzina addosso e correre in un edificio in fiamme piuttosto che usare Excel.   We trained our models over 10 datasets and tested them over five different test sets in five languages. A summary of the datasets is shown in Table 2. It was impossible to obtain all of the original datasets because of the nature of the data: if a tweet has been deleted, it should not be available online. Nevertheless, we think that our results are competitive since we are using state-of-the-art models and obtain better results than what are reported in the articles using the original datasets. For the English test set, which is the exact one used for SemEval-2017, our results are higher than the winner of the challenge.
For the pre-training, we used the SemEval-2013 to SemEval-2016 challenges for a total of 47,762 tweets, and 238,810 using data-augmentation with automatic translation. The 2000 tweets from the devtest of SemEval-2016 were used as development set and the test set from SemEval-2017 was used as test set. For the French, German and Italian datasets, we used the same partition than the one used in the original challenges, with the tweets that were available online. For the Spanish dataset, the test sets were not available, hence we used the development set of TASS-2019 as test set, and the development set of TASS-2018 as development set. For the Italian dataset, we discarded the tweets with mixed sentiments.
We computed metrics that are broadly employed for this kind of tasks in order to compare our models: the Average-Recall, the average of the F1 score between positive and negative example, as well as the macro F1 score.

Results
The results of the experiments are shown in Table 3.We can compare the results of vanilla models with models pre-trained over English datasets for sentiment analysis and with models pre-trained over English and automatically translated datasets for sentiment analysis. MT was used to translate all the training sets into 4 other languages. One important thing to note is that we do not compare our system to other state-of-the-art systems on those datasets. This is due to the non availability of the complete datasets.
Here we are focusing on the impact of data-augmentation using automatic translation combined with a multilingual model pre-trained over English tweets.
Nevertheless, we believe that the results of the first configuration, without any pre-training neither data-augmentation are very competitive. For example, the best result reported by the authors of the SB10k has a F1 PN of 65.09, which is below the performance of 67.1 we obtained with the Vanilla configuration.
The best results overall non-English languages are obtained using the pre-training as well as the dataaugmentation technique.  Table 3: Results of the different configurations. All the models were originally pre-trained over general text data.

Analysis
Because the original tweets datasets are not available online, it is difficult for us to compare with the results in the literature for the datasets other than English. Nevertheless, the focus of our paper is not on beating the state-of-the-art but propose an method to use multilingual data to enhance the performance of a model using non-english data. Interestingly, we found better results than (Nguyen et al., 2020) for both the RoBERTa and XLM-RoBERTa over SemEval-2017. We think that this may be the result of adjusting class weights in our loss function to manage imbalanced classes.

Monolingual versus multilingual
As it is pointed out by (Nguyen et al., 2020), the best results over English are obtained using a monolingual model, when compared to the same multilingual model. Hence, RoBERTa reaches higher performances than its multilingual counterpart the XLM-RoBERTa. This behavior is reproduced on the French datasets using CamemBERT, the French version of RoBERTa (Martin et al., 2020). Nevertheless, the pre-training of the multilingual model allow to obtain an increase in the performance of the French model. We think that this may be due to the lack of available examples in the target language. This confirms the hypothesis that pre-training a multilingual model with available data to use it on another language can be a good strategy to improve the results on a target language having less available examples for training.
Effect of data-augmentation Finally, the data-augmentation technique improves slightly the results for almost every language in different proportions. The biggest amelioration is obtained for Spanish, with an improvement of 1.5 points of the average recall compared with the model only pre-trained over English. The improvement over German is questionable. This may be due to the size of the dataset. The German train set is more than twice the size of the Spanish, which is the language were pre-training gives the better boost to the performances.

Conclusion
We presented a technique that helps to improve the results of a sentiment analysis system over nonenglish tweets. We use multilingual model that is able to process external English data available in big quantities to pre-train the model, and machine translation to augment the dataset. This technique is simple and yet allows to take advantage of the multilingual models for non-English tweet datasets of limited size.