Multi-Level Sentiment Analysis of PolEmo 2.0: Extended Corpus of Multi-Domain Consumer Reviews

In this article we present an extended version of PolEmo – a corpus of consumer reviews from 4 domains: medicine, hotels, products and school. Current version (PolEmo 2.0) contains 8,216 reviews having 57,466 sentences. Each text and sentence was manually annotated with sentiment in 2+1 scheme, which gives a total of 197,046 annotations. We obtained a high value of Positive Specific Agreement, which is 0.91 for texts and 0.88 for sentences. PolEmo 2.0 is publicly available under a Creative Commons copyright license. We explored recent deep learning approaches for the recognition of sentiment, such as Bi-directional Long Short-Term Memory (BiLSTM) and Bidirectional Encoder Representations from Transformers (BERT).

guage processing has resulted in a significant increase in sentiment analysis techniques (Zhang et al., 2018). This increase for some languages is effectively limited by the lack of good quality resources for this task, especially in the form of manually annotated corpora (Balahur and Turchi, 2012;Dashtipour et al., 2016).
Analysis of the existing language resources in the area of sentiment analysis shows that they largely concern the English language (Dashtipour et al., 2016). However, there is a clear growing interest in other languages, often much more complex than English (e.g. Slavic languages in the area of loose syntax and rich inflection) and new resources become available for them, e.g., Slovene (Bučar et al., 2018), Czech (Habernal and Brychcín, 2013) or Russian (Rogers et al., 2018). Due to a small number of available corpora manually annotated with sentiment for the Polish language, we decided that the construction of the PolEmo resource will be a valuable contribution to the collection of publicly available resources for sentiment analysis and may in the future provide a basis for the creation of shared tasks, in which the recognition of sentiment for the Polish language will also be included. Both for the construction of the corpus and for further research, we used the experience from the work on the manual annotation of the Polish WordNet -plWordNet 4.0 Emo (Janz et al., 2017;Kocoń et al., 2018a,b) -as a result of which the sentiment metadata of more than 55,000 lexical units were described.
The main objectives of the article are to present: • The current state of resources related to the analysis of sentiment for the Polish language; • The method of selecting data for the PolEmo 2.0 corpus, the annotation method, the annotation results and the analysis of annotation errors; • The results of research related to the automatic analysis of sentiment, with particular emphasis on the importance of the text domain in this topic.
The key contribution of these studies includes: • Detailed description of the procedure of building PolEmo 2.0: manually annotated corpus of consumer reviews from 4 domains (medicine, school, hotels, products) at 2 levels of sentiment granularity (document, sentence); • Detailed analysis of manual annotation with regard to frequently occurring errors; • Development of methods based on deep learning (BiLSTM, BERT), adapted to PolEmo 2.0 corpus, also using sentiment lexicon generated from plWordNet 4.0 Emo; • Performing tests on sets prepared for the analysis of the quality of methods (1) evaluated on texts within a given domain, (2) evaluated on texts from various domains (3) trained on texts that do not include a given domain and tested on a given domain; • Comparison of deep learning methods with classic methods (Logistic Regression), especially in the context of the ability to generalize the problem of recognizing sentiment and providing semantic representation, which is as independent of the domain as possible; • Making PolEmo 2.0 corpus available under an open license.

Related Work
There are several well-known resources annotated with sentiment for English, e.g.: MPQA 3.0 (Deng and Wiebe, 2015), the Stanford Sentiment Treebank (Socher et al., 2013), Amazon Product Data (He and McAuley, 2016), Pros And Cons Dataset (Ganapathibhotla and Liu, 2008), corpora developed within the Semantic Evaluation workshops (Nakov et al., 2016;Pontiki et al., 2016), SentiWordNet (Baccianella et al., 2010) or Opinion Lexicon (Hu and Liu, 2004). There are also different approches and tools used for multilingual sentiment analysis (Lo et al., 2017) which are based on transformations on the existing resources. In this section we are focusing on the resources prepared directly for Polish.

Polish Sentiment Corpora
There are corpora for the Polish language that can be used for automatic sentiment analysis. One of them is a corpus prepared for the sentiment recognition shared task within PolEval2017 1 workshop (Wawer and Ogrodniczuk, 2017). The corpus contains 1550 sentences annotated at the level of phrases determined by the dependency parser. The sentences came from consumer reviews and covered 3 domains: perfume, clothing and other. Each node of the dependency tree received one of the three sentiment annotations: -1 (negative), 0 (neutral), 1 (positive). Most of the systems participating in the PolEval2017 competition used Tree LSTM adapted to dependency trees, including the best system, which reached an accuracy of 79% on this data. Another resource is HateSpeech 2 corpus containing 2,000 posts crawled from public Polish web. These texts were annotated for hate speech. The annotator team reached an agreement score of Krippendorff's α = 0.6 (Krippendorff, 2018). The SVM model trained on a subset of 1500 texts (containing equal amounts of hate speech and non-hate speech) obtained the precision of 0.8 (Troszyński and Wawer, 2017).
Other interesting resource is the Polish Corpus of Suicide Notes (PCSN) (Zaśko-Zielińska, 2013). The PCSN is one of very few such resources in the world. It includes 1,244 genuine SNs that have been scanned and manually transcribed. Each SN was linguistically annotated on several levels, including selected semantic and pragmatic phenomena (Zaśko-Zielińska, 2013). The annotation is stored in a TEI-based format (Marcińczuk et al., 2011) with corrected version in a separate layer. PCSN includes also a subcorpus of 334 counterfeited SNs (elicited). They were created by volunteers who were asked to imitate a real SN for imaginary person whose characteristic had been provided at the beginning of the experiment. Most volunteers were told that the notes written by them would be used to deceive the computer program. Due to the sensitive nature of the texts and legal obligations of the author, the corpus is not publicly available. In the experiment described in article  we have collected 3,200 texts from the Internet as examples of non-letters. Using SVM with a rich set of features we obtained 90,06% (F1-score) in the task of distinguishing between genuine SNs, counterfeited SNs and nonletters.

Polish Sentiment Lexicons
One of the largest Polish sentiment lexical resources in terms of number of annotations is plWordNet 4.0 Emo 3 (Janz et al., 2017;Kocoń et al., 2018a). This dataset is available under the WordNet 3.0 license. It was built within CLARIN-PL 4 project (Piasecki, 2014). The manual annotation is done at the level of lexical units (Zaśko-Zielińska et al., 2015). Available values for polarity are: strong negative, weak negative, neutral, weak positive, strong positive, ambiguous. One annotator could assign only one of these values for a single lexical unit. There are more than 83,000 annotations covering more than 54,000 lexical units and 41,000 synsets (Kocoń et al., 2018b). About 22,000 of the polarity annotations are different than neutral and these annotations cover 13,000 lexical units and 9,000 synsets (22% of all synsets containing annotated units). plWordNet 4.0 Emo is used in the research presented in this article as a knowledge base for the sentiment recognition task.
Another lexicon is the Nencki Affective Word List (NAWL) 5 . It is a database of Polish words suitable for studying various aspects of language and emo-tions. 2902 Polish words from the NAWL were presented to 265 subjects, who were instructed to rate them according to the intensity of each of the five basic emotions: happiness, anger, sadness, fear and disgust. The total number of ratings was 385,575.
The next resource is called the Polish Sentiment Dictionary 6 (Wawer, 2012;Wawer and Rogozinska, 2012). It contains 3,704 words with sentiment scores computed using supervised methods presented in (Wawer and Rogozinska, 2012).
Recently, a new resource has appeared in the Sentimenti project, containing a large database of annotated lexical units and annotated texts. Details are described in Section 2.3.

Sentimenti Project
This year, the first results of the Sentimenti 7 project (Kocoń et al., 2019a) were published, which were aimed at creating methods of analyzing texts written on the Internet in terms of emotions aroused by the recipients of the analysed content. A large database has been created, in which 30,000 lexical units from plWord-Net database  and 7,000 texts were annotated. Most of the texts were consumer reviews from the domain of hotels and medicine. The elements were annotated by 20,000 unique Polish respondents in the Computer Assisted Web Interview survey and more than 50 marks were obtained for each element. Within each mark, polarisation of the element, stimulation and basic emotions aroused by the recipients are determined. The total number of manual annotations is 3,742,611 for texts and 19,141,041 for lexical units The first results concerning the automatic recognition of polarity and emotions for this set are presented in (Kocoń et al., 2019a) and propagation of this annotation with the use of Heterogeneous Structured Synset Embeddings is presented in (Kocoń et al., 2019b). Due to the commercial nature of the Sentimenti project, it is planned to publish only 20% of the project data available soon. The data will be published at the main project's site 7 .
The Sentimenti project has interested both the scientific community and business. Within the CLARIN-PL project, we decided that in addition to a large annotated plWordNet lexicon, there should also be a large corpus annotated with sentiment, available under an open license. In the next part we present the works related to the preparation of PolEmo.

Motivation
Linguistic research on sentiment recognition involves two approaches: (1) bottom-up from the perspective of analysing the occurrence of emotional words and (2) top-down from the perspective of the entire document. The first attempt is usually a consequence of the creation of the sentiment lexicon, e.g. manual annotation of the Word-Net (Baccianella et al., 2010). The second results from the analysis of the specific text content in which we see that the sentiment of a word or phrase changes under the influence of the surrounding context (Taboada et al., 2008). This change may vary depending on the domain of the text.
A discourse perspective in sentiment analysis is an attempt to address limitations of bottomup methods (e.g. problems with negation, focusing on adjectives). It used findings of Rhetorical Structure Theory (Mann and Thompson, 1988). The attempt bears in mind local and global orientation in the text, discourse structure or topicality (Taboada et al., 2008). It allows the researcher to extract the most important sentences from the text in the perspective of the entire discourse context: nucleus satellite method (Wang et al., 2012). The relevance of the sentences is evaluated in relation to the main topic and the analysis omits some less important parts of the text.
There are interesting articles focused at domainoriented sentiment analysis (Kanayama and Nasukawa, 2006), where a system is trained on labeled reviews from one source domain but is meant to be deployed on another (Glorot et al., 2011). The latter article describes the research carried out on the Amazon Product Data (He and McAuley, 2016). The ratings were assigned to reviews by authors of the reviews. Moreover, the ratings were applied to the entire text. Our idea was to obtain such a set of reviews that would be rated by the recipients and not by the authors of the content. The annotation should take into account not only the level of the entire review, but also the level of the individual sentences of the review. Additionally, this dataset was supposed to be a multi-domain one, to evaluate potential knowledge transfer across domains.

Dataset
In the initial part of the work, presented in article (Kocoń et al., 2019), we have chosen online customer reviews from four domains, presented in Table 1. At the beginning of our work we had only 1000 texts for each of the following domains: school, products, medicine. In the case of product reviews, we also had metadata from the reviewer, how many stars he assigned to a specific review (from 1 to 5, where 5 means the most positive review). We used this information to select the reviews for the corpus, where 200 reviews from each star category were added. On the basis of a preliminary analysis of several dozen examples of opinions, we have come to the conclusion that neutral examples are very difficult to find in the case of reviews. In the meantime, the corpus was extended by 8000 texts from the category Medicine and 17000 texts from the category Hotels, also with a uniform distribution in relation to the star categories available in the source data (also 1 to 5). In order to capture the phenomenon of neutral text, we decided to add 2000 new texts to each of the last two fields (medicine, hotels). These texts were fragments of articles from information portals on hotel industry 8 and health 9 .
In Section 3.3 we present how the genre structure of a customer review affects the text sentiment polarity. It is an enhancement of the discourse perspective in sentiment analysis.

Pilot Annotation
Our CLARIN-PL pilot study on sentiment analysis of customer reviews was conducted in 2018. The initial part of the analysis included 3,000 reviews. Each text was manually annotated by two annotators: a psychologist and a linguist, who worked according to the general guidelines. The annotation tool used for this task was Inforex 10 (Marcińczuk et al., 2012; Oleksy, 2019) -a web-based system for text corpora management, annotation and analysis, available as an open source project. In the pilot project, we decided to deal with the sentiment annotation of the entire text. There was also an attempt to manually extract descriptions of particular aspects of the review. In both annotation cases we used the same tag system that is used in plWordNet Emo for lexical units: . We assumed that reviews are always characterised by a certain polarity, which is why we did not use the [0] (neutral) tag in the pilot annotation.
In the process of annotation we focused mainly on the strategic places of the text. In the consumer review these are opening and closing sentences, i.e. a text frame. The opening sentences consist of the general opinion of the author about the subject of the review, and the closing sentences contain the author's recommendation for the review recipients. The annotators have developed their first overall rating based on these two segments.
In the text, review authors changed their opinions only subtly. Regardless of the modification of the main opinion in the text, we did not use the [amb] tag when the frame of the text was clearly positive or negative. Polarity of the text frame was influenced not only by the lexical content, but also by non-verbal elements: emoticons or multiplication of punctuation marks, e.g. exclamation marks.
The annotators were also recommended to distinguish those parts of the text that are placed in one sentence, but relate to different aspects (e.g. the teacher's appearance or teaching skills). This task turned out to be very difficult, specially in specifying, even with the help of guidelines, how to mark precisely in the text the boundaries of a given aspect. The Positive Specific Agreement (Hripcsak and Rothschild, 2005) between the annotators in the task of annotating the boundaries of aspects was below 0.15. The concept of annotation was radically changed and presented in Section 3.4. 10 https://github.com/CLARIN-PL/Inforex

PolEmo Annotation Guidelines
In the main stage of the project we decided to annotate the sentiment for the whole text (a meta level) and the sentence level. We assumed that this strategy allows to establish the acceptable value of PSA, because the division of the text into sentences was determined by the MACA 11 tool (Radziszewski andŚniatowski, 2011), so the task was limited only to annotating the sentiment of the sentence. We followed the rule that the meta annotation results partially from sentence annotations, however the frame polarity is the main factor for the final meta annotation. We have prepared the following annotation tags, regardless of whether the entire text or sentence is annotated: • SP -entirely positive; • WP -generally positive, but there are some negative aspects within the review; • 0 -neutral; • WN -generally negative, but there are some positive aspects within the review; • SN -entirely negative; • AMB -there are both positive and negative aspects in the text that are balanced in terms of relevance.
This time we used [0] tag (neutral) because in the main stage of the project we extended the corpus with neutral texts presented in Section 3.2. Also reviews that are not neutral often contain neutral sentences.
We tested the new guidelines on a subset of 50 documents, achieving a PSA of 80% for the meta level and 78% for the sentence level. In the second iteration of the annotation guidelines improvement, the values were 87% (meta) and 85% (sentence). In the last iteration of the improvement of the guidelines, the annotators reached a PSA of 90% (meta) and 87% (sentence).

PolEmo 2.0 Annotation Analysis
We decided to publish the first results of the research on the PolEmo 1.0 corpus when the number of annotated reviews reached 8462 and the number of annotated sentences was 35724 (Kocoń et al., 2019). Due to the fact that in PolEmo 2.0 there are only those annotated elements that received 2 annotations from linguists and were agreed by the super-annotator, this time we provide 8216 reviews and 57466 sentences. In Section 5 we present Table 7 with the final distribution of annotations and Table 6 with the number of elements in each domain (evaluation data splits). In this section we focus on annotation agreement and annotation errors.    Table 2 presents PSA values obtained at the level of text and sentence for all domains. The overall PSA value for texts is 83.41% and for sentences is 84.56%. It is worth noting that for the domains to which we have not added neutral texts (products, school), there are practically no neutral annotations at the text level (see Table 7). The highest values are obtained for the most obvious categories (SP, SN and 0), regardless of the level of text description. For the remaining categories PSA value is lower than 40.00% in most cases.   Table 3 presents the distribution of disagreements between annotators at the text level. The most common disagreement is within the pair of tags [AMB/WP]. Nearly half of the disagreements are related to any pair of AMB, WP and WN tags. This suggests that annotators, despite the guidelines, have difficulty in judging the relevance of aspects regardless of the domain, or it is a very subjective task.    Table 4 presents the distribution of disagreements between annotators at the sentence level. The most common disagreement is within the pair of tags [SN/0]. This time the cases of disagreements between A/WP/WN tags are less than 20%. Most of the errors are related to the neutral sentence marking. The analysis of specific cases and a discussion with linguists showed that in the task of annotating sentences it is difficult to isolate a sentence from the context and sometimes the annotation of the next sentence was a consequence of the sentiment of the previous sentence.
We have found that it is difficult to decide on the relevance of the aspects and without creating a hierarchy of relevance of aspects for a given domain it will be hard to achieve better agreement for WP/WN/AMB tags. Due to the fact that mistakes are often within these tags, we have combined them into one AMB tag. PolEmo 2.0 will also be available for the original tags, but research (Kocoń et al., 2019) has shown that machine learning methods achieve F-score for WP/WN/AMB classes no higher than PSA. The evaluation data in this research has WP/WN/AMB tags merged into one AMB tag. Table 5 presents PSA values after the merging step. The total PSA increased from 83% to 91% for texts and from 85% to 88% for sentences.

Multi-Level Sentiment Recognition
Recently deep neural networks show relatively good performance among all available methods of processing such information (Glorot et al., 2011). Possibility of retrieving data from different sources like social networks (Pak and Paroubek, 2010), publicly available discussion boards or   marketing platforms connected with proper annotations on training data set can provide not only simple positive, negative or neutral classification but lead to accurate fine-grained sentiment prediction (Guzman and Maalej, 2014). We selected the same classifiers for the recognition tasks as in (Kocoń et al., 2019): (1) Logistic Regression as a fastText recognition model (Joulin et al., 2017) with KGR10 word embeddings (Kocoń and Gawor, 2018) providing a baseline for text classification; (2) BiLSTM  in two variants: KGR10 embeddings as features only and KGR10 embeddings extended with general polarity information from sentiment dictionary described in (Kocoń et al., 2019); (3) BERT (Devlin et al., 2018) with additional sequence classification layer.
We changed the architecture of BiLSTM and BERT architecture. In case of BiLSTM, instead of fixed input length we changed the model to work with text of any length. The input tensor shape is (None, 300) for embedding-only variant (BiL-STM) and (None, 306) for embedding+dictionary variant (BiLSTMd). We changed the shape of the second gaussian noise layer to (None, 300)/(None, 306), respectively. Next layers remain the same, i.e. (1) BiLSTM layer with 1024 hidden units, (2) dropout layer (ratio 0.2). Last dense layer changed due to the reduction of sentiment labels from 6 to 4 by label merging process described in Section 3.5. For BERT we used the same architecture as in (Kocoń et al., 2019) for the whole texts, but we changed it for sentences. We reduced the maximum sequence length from 512 to 64 (cov-ers more than 99% of sentences) and we increased batch size from 32 to 128.

Evaluation
As in article (Kocoń et al., 2019a;Kocoń et al., 2019), we prepared three variants of evaluation of the sentiment classification methods: • SD -Single Domain -evaluation sets created using elements from the same domain; • DO -Domain Out -train/dev sets created using elements from 3 domains, test set from the remaining domain. This variant allows to evaluate the ability of the classification method to capture the domain-intependent sentiment features; • MD -Mixed Domains -SD train/dev/test sets joined respectively. This variant allows to examine the ability of the classifier to generalise the task of sentiment analysis in all available domains.
We use SDT, DOT, and MDT abbreviations for text evaluation types and SDS, DOS, and MDS for sentence evaluation types. We use also prefixes of domains (Hotels, Medicine, School, Products) as suffixes for SD* and DO* variants, e.g. SDS-H is a Single Domain evaluation type performed on Sentences within Hotels domain, whereas DOT-M is a Domain-Out evaluation type performed on Texts trained on texts outside Medicine domain and tested on texts from that domain. Table 6 shows the number of texts and sentences annotated by linguists for all evaluation types, with division into the number of elements within training, validation and test sets. The distribution of labels for each domain (both texts and sentences) is presented in Table 7. Table 8 presents the values of F1-score for each label, global F1-score, micro-AUC and macro-AUC for all evaluation types related to the texts. In case of evaluation for a single domain for each label, fastText (using Logistic Regression) outperformed other classifiers in 16 out of 28 distinguishable cases. The worst results are obtained for ambiguous cases, but in 9 out of 13 cases F1-score is higher than 0.5 and this result is much better, than obtained for intermediate labels (weak positive and weak negative) presented in work (Kocoń

Conclusions
BERT's performance is below the expectations of this advanced method in case of the classification of the whole texts. Looking at both tables (8 and 9), BERT's results are the best in 64 out of 182 label-specific cases. BiLSTM outperformed other methods in 48 cases. Adding an external sentiment dictionary helped in 40 label-specific cases.
Overall BiLSTM performance is better in 88 out of 182 cases. BERT dominance (when distinguishing between BiLSTM and BiLSTMd) is observed in DOT and all sentence cases. MDT case is the most promising in terms of the further use of the recognition method in applications such as brand monitoring or early crisis detection. The values of the general F1, micro AUC and macro AUC are the highest for BiLSTM variants (see Table 6). We published PolEmo 2.0 in CLARIN-PL DSpace repository 12 under the Creative Commons 4.0 License. We also intend to test the contextualized embedding that we are currently build-  ing using the ELMo deep word representations method (Peters et al., 2018), with the use of the large KGR10 corpus presented in work (Kocoń et al., 2019a). We also want to train the basic BERT model with the use of KGR10 to investigate whether it will improve the quality of sentiment recognition. It is also very interesting to use the propagation of sentiment annotation in Word-Net (Kocoń et al., 2018a,b), to increase the coverage of the sentiment dictionary and to potentially improve the recognition quality as well. This objective can be achieved by other complex methods such as OpenAI GPT-2 (Radford et al., 2019) and domain dictionaries construction methods utilising WordNet (Kocoń and Marcińczuk, 2016