SemEval-2016 Task 5: Aspect Based Sentiment Analysis

This paper describes the SemEval 2016 shared task on Aspect Based Sentiment Analysis (ABSA), a continuation of the respective tasks of 2014 and 2015. In its third year, the task provided 19 training and 20 testing datasets for 8 languages and 7 domains, as well as a common evaluation procedure. From these datasets, 25 were for sentence-level and 14 for text-level ABSA; the latter was introduced for the first time as a subtask in SemEval. The task attracted 245 submissions from 29 teams.


Introduction
Many consumers use the Web to share their experiences about products, services or travel destinations (Yoo and Gretzel, 2008). Online opinionated texts (e.g., reviews, tweets) are important for consumer decision making (Chevalier and Mayzlin, 2006) and constitute a source of valuable customer feedback that can help companies to measure satisfaction and improve their products or services. In this setting, Aspect Based Sentiment Analysis (ABSA) -i.e., mining opinions from text about specific entities and their aspects (Liu, 2012) -can provide valuable insights to both consumers and businesses. An ABSA * *Corresponding author: mpontiki@ilsp.gr. method can analyze large amounts of unstructured texts and extract (coarse-or fine-grained) information not included in the user ratings that are available in some review sites (e.g., Fig. 1).
Sentiment Analysis (SA) touches every aspect (e.g., entity recognition, coreference resolution, negation handling) of Natural Language Processing (Liu, 2012) and as Cambria et al. (2013) mention "it requires a deep understanding of the explicit and implicit, regular and irregular, and syntactic and semantic language rules". Within the last few years several SA-related shared tasks have been organized in the context of workshops and conferences focus-ing on somewhat different research problems (Seki et al., 2007;Seki et al., 2008;Seki et al., 2010;Mitchell, 2013;Nakov et al., 2013;Rosenthal et al., 2014;Pontiki et al., 2014;Rosenthal et al., 2015;Ghosh et al., 2015;Pontiki et al., 2015;Mohammad et al., 2016;Recupero and Cambria, 2014;Ruppenhofer et al., 2014;Loukachevitch et al., 2015). Such competitions provide training datasets and the opportunity for direct comparison of different approaches on common test sets.
Currently, most of the available SA-related datasets, whether released in the context of shared tasks or not (Socher et al., 2013;Ganu et al., 2009), are monolingual and usually focus on English texts. Multilingual datasets (Klinger and Cimiano, 2014;Jiménez-Zafra et al., 2015) provide additional benefits enabling the development and testing of crosslingual methods (Lambert, 2015). Following this direction, this year the SemEval ABSA task provided datasets in a variety of languages.
ABSA was introduced as a shared task for the first time in the context of SemEval in 2014; SemEval-2014 Task 4 1 (SE-ABSA14) provided datasets of English reviews annotated at the sentence level with aspect terms (e.g., "mouse", "pizza") and their polarity for the laptop and restaurant domains, as well as coarser aspect categories (e.g., "food") and their polarity only for restaurants (Pontiki et al., 2014). SemEval-2015 Task 12 2 (SE-ABSA15) built upon SE-ABSA14 and consolidated its subtasks into a unified framework in which all the identified constituents of the expressed opinions (i.e., aspects, opinion target expressions and sentiment polarities) meet a set of guidelines and are linked to each other within sentence-level tuples (Pontiki et al., 2015). These tuples are important since they indicate the part of text within which a specific opinion is expressed. However, a user might also be interested in the overall rating of the text towards a particular aspect. Such ratings can be used to estimate the mean sentiment per aspect from multiple reviews (McAuley et al., 2012). Therefore, in addition to sentence-level annotations, SE-ABSA16 3 accommodated also text-level ABSA annotations and provided the respective training and testing data. Fur-1 http://alt.qcri.org/semeval2014/task4/ 2 http://alt.qcri.org/semeval2015/task12/ 3 http://alt.qcri.org/semeval2016/task5/ thermore, the SE-ABSA15 annotation framework was extended to new domains and applied to languages other than English (Arabic, Chinese, Dutch, French, Russian, Spanish, and Turkish).
The remainder of this paper is organized as follows: the task set-up is described in Section 2. Section 3 provides information about the datasets and the annotation process, while Section 4 presents the evaluation measures and the baselines. General information about participation in the task is provided in Section 5. The evaluation scores of the participating systems are presented and discussed in Section 6. The paper concludes with an overall assessment of the task.

Task Description
The SE-ABSA16 task consisted of the following subtasks and slots. Participants were free to choose the subtasks, slots, domains and languages they wished to participate in.
Subtask 1 (SB1): Sentence-level ABSA. Given an opinionated text about a target entity, identify all the opinion tuples with the following types (tuple slots) of information: • Slot1: Aspect Category. Identification of the entity E and attribute A pairs towards which an opinion is expressed in a given sentence. E and A should be chosen from predefined inventories 4 of entity types (e.g., "restaurant", "food") and attribute labels (e.g., "price", "quality").
• Slot2: Opinion Target Expression (OTE). Extraction of the linguistic expression used in the given text to refer to the reviewed entity E of each E#A pair. The OTE is defined by its starting and ending offsets. When there is no explicit mention of the entity, the slot takes the value "null". The identification of Slot2 values was required only in the restaurants, hotels, museums and telecommunications domains.
Subtask 2 (SB2): Text-level ABSA. Given a customer review about a target entity, the goal was to identify a set of {cat, pol} tuples that summarize the opinions expressed in the review. cat can be assigned the same values as in SB1 (E#A tuple), while pol can be set to "positive", "negative", "neutral", or "conflict". For example, for the review text "The So called laptop Runs to Slow and I hate it! Do not buy it! It is the worst laptop ever ", a system should return the following opinion tuples: {cat: "laptop#general", pol: "negative"}, {cat: "laptop#operation_performance", pol: "negative"} .
Subtask 3 (SB3): Out-of-domain ABSA. In SB3 participants had the opportunity to test their systems in domains for which no training data was made available; the domains remained unknown until the start of the evaluation period. Test data for SB3 were provided only for the museums domain in French.

Data Collection and Annotation
A total of 39 datasets were provided in the context of the SE-ABSA16 task; 19 for training and 20 for testing. The texts were from 7 domains and 8 languages; English (en), Arabic (ar), Chinese (ch), Dutch (du), French (fr), Russian (ru), Spanish (es) and Turkish (tu). The datasets for the domains of restaurants (rest), laptops (lapt), mobile phones (phns), digital cameras (came), hotels (hote) and museums (muse) consist of customer reviews, whilst the telecommunication domain (telc) data consists of tweets. A total of 70790 manually annotated ABSA tuples were provided for training and testing; 47654 sentencelevel annotations (SB1) in 8 languages for 7 domains, and 23136 text-level annotations (SB2) in 6 languages for 3 domains. Table 1 provides more information on the distribution of texts, sentences and annotated tuples per dataset.
The rest, hote, and lapt datasets were annotated at the sentence-level (SB1) following the respective annotation schemas of SE-ABSA15 (Pontiki et al., 2015). Below are examples 5 of annotated sentences for the aspect category "service#general" in en (1), du (2), fr (3), ru (4), es (5), and tu (6) for the rest domain and in ar (7) for the hote domain: 1. Service was slow, but the people were friendly. 3. 当然屏幕这么好 →{cat: "display#quality", pol: "positive"} 4. 更 轻 便 的 机 身 也 便 于 携 带。→ {cat: "camera# portability", pol: "positive"} In addition, the SE-ABSA15 framework was extended to two new domains for which annotation guidelines were compiled: telc for tu and muse for fr. Below are two examples: 1. #Internet kopuyor sürekli :( @turkcell → {cat: "internet#coverage", trg: "Internet", pol: "positive"} 2. 5€ pour les étudiants, ça vaut le coup. → {cat: "museum#prices", "null", "positive"} The text-level (SB2) annotation task was based on the sentence-level annotations; given a customer review about a target entity (e.g., a restaurant) that included sentence-level annotations of ABSA tuples, the goal was to identify a set of {cat, pol} tuples that summarize the opinions expressed in it. This was not a simple summation/aggregation of the sentence-level annotations since an aspect may be discussed with different sentiment in different parts of the review. In such cases the dominant sentiment had to be identified. In case of conflicting opinions where the dominant sentiment was not clear, the "conflict" label was assigned. In addition, each review was assigned an overall sentiment label about the target entity (e.g., "restaurant#general", "laptop#general"), even if it was not included in the sentence-level annotations.

Annotation Process
All datasets for each language were prepared by one or more research groups as shown in Table 2. The en, du, fr, ru and es datasets were annotated using brat (Stenetorp et al., 2012), a web-based annotation tool, which was configured appropriately for the needs of the task. The tu datasets were annotated using a customized version of turksent (Eryigit et al., 2013), a sentiment annotation tool for social media. For the ar and the ch data in-house tools 6 were used.  Below are some further details about the annotation process for each language.
English. The SE-ABSA15 (Pontiki et al., 2015) training and test datasets (with some minor corrections) were merged and provided for training (rest and lapt domains). New data was collected and annotated from scratch for testing. In a first phase, the rest test data was annotated by an experienced 7 linguist (annotator A), and the lapt data by 5 undergraduate computer science students. The resulting annotations for both domains were then inspected and corrected (if needed) by a second expert linguist, one of the task organizers (annotator B). Borderline cases were resolved collaboratively by annotators A and B.
Arabic. The hote dataset was annotated in repeated cycles. In a first phase, the data was annotated by three native Arabic speakers, all with a computer science background; then the output was validated by a senior researcher, one of the task organizers. If needed (e.g. when inconsistencies were found) they were given back to the annotators.
Chinese. The datasets presented by Zhao et al. (2015) were re-annotated by three native Chinese speakers according to the SE-ABSA16 annotation schema and were provided for training and testing (phns and came domains).
Dutch. The rest and phns datasets (De Clercq and Hoste, 2016) were initially annotated by a trained linguist, native speaker of Dutch. Then, the output was verified by another Dutch linguist and disagreements were resolved between them. Fi-7 Also annotator for SE-ABSA14 and 15. nally, the task organizers inspected collaboratively all the annotated data and corrections were made when needed.
French. The train (rest) and test (rest, muse) datasets were annotated from scratch by a linguist, native speaker of French. When the annotator was not confident, a decision was made collaboratively with the organizers. In a second phase, the task organizers checked all the annotations for mistakes and inconsistencies and corrected them, when necessary. Russian. The rest datasets of the SentiRuEval-2015 task (Loukachevitch et al., 2015) were automatically converted to the SE-ABSA16 annotation schema; then a linguist, native speaker of Russian, checked them and added missing information. Finally, the datasets were inspected by a second linguist annotator (also native speaker of Russian) for mistakes and inconsistencies, which were resolved along with one of the task organizers.
Spanish. Initially, 50 texts (134 sentences) from the whole available data were annotated by 4 annotators. The inter-anotator agreement (IAA) in terms of F-1 was 91% for the identification of OTE, 88% for the aspect category detection (E#A pair), and 80% for opinion tuples extraction (E#A, OTE, polarity). Provided that the IAA was substantially high for all slots, the rest of the data was divided into 4 parts and each one was annotated by a different native Spanish speakers (2 linguists and 2 software engineers). Subsequently, the resulting annotations were validated and corrected (if needed) by the task organizers.
Turkish. The telc dataset was based on the data used in (Yıldırım et al., 2015), while the rest dataset was created from scratch. Both datasets were annotated simultaneously by two linguists. Then, one of the organizers validated/inspected the resulting annotations and corrected them when needed.

Datasets Format and Availability
Similarly to SE-ABSA14 and SE-ABSA15, the datasets 8 of SE-ABSA16 were provided in an XML format and they are available under specific license terms through META-SHARE 9 , a repository devoted to the sharing and dissemination of language resources (Piperidis, 2012).

Evaluation Measures and Baselines
The evaluation ran in two phases. In the first phase (Phase A), the participants were asked to return separately the aspect categories (Slot1), the OTEs (Slot2), and the {Slot1, Slot2} tuples for SB1. For SB2 the respective text-level categories had to be identified. In the second phase (Phase B), the gold annotations for the test sets of Phase A were provided and participants had to return the respective sentiment polarity values (Slot3). Similarly to SE-ABSA15, F-1 scores were calculated for Slot1, Slot2 and {Slot1, Slot2} tuples, by comparing the annotations that a system returned to the gold annotations (using micro-averaging). For Slot1 evaluation, duplicate occurrences of categories were ignored in both SB1 and SB2. For Slot2, the calculation for each sentence considered only distinct targets and discarded "null" targets, since they do not correspond to explicit mentions. To evaluate sentiment polarity classification (Slot3) in Phase B, we calculated the accuracy of each system, defined as the number of correctly predicted polarity labels of the (gold) aspect categories, divided by the total number of the gold aspect categories. Furthermore, we implemented and provided baselines for all slots of SB1 and SB2. In particular, the SE-ABSA15 baselines that were implemented for the English language 8 The data are available at: http://metashare.ilsp. gr:8080/repository/search/?q=semeval+2016 9 META-SHARE (http://www.metashare.org/) was implemented in the framework of the META-NET Network of Excellence (http://www.meta-net.eu/). (Pontiki et al., 2015), were adapted for the other languages by using appropriate stopword lists and tokenization functions. The baselines are briefly discussed below: SB1-Slot1: For category (E#A) extraction, a Support Vector Machine (SVM) with a linear kernel is trained. In particular, n unigram features are extracted from the respective sentence of each tuple that is encountered in the training data. The category value (e.g., "service#general") of the tuple is used as the correct label of the feature vector. Similarly, for each test sentence s, a feature vector is built and the trained SVM is used to predict the probabilities of assigning each possible category to s (e.g., {"service#general", 0.2}, {"restaurant#general", 0.4}. Then, a threshold 10 t is used to decide which of the categories will be assigned 11 to s. As features, we use the 1,000 most frequent unigrams of the training data excluding stopwords. SB1-Slot2: The baseline uses the training reviews to create for each category c (e.g., "service#general") a list of OTEs (e.g., "service#general" → {"staff", "waiter"}). These are extracted from the (training) opinion tuples whose category value is c . Then, given a test sentence s and an assigned category c, the baseline finds in s the first occurrence of each OTE of c's list. The OTE slot is filled with the first of the target occurrences found in s. If no target occurrences are found, the slot is assigned the value "null". SB1-Slot3: For polarity prediction we trained a SVM classifier with a linear kernel. Again, as in Slot1, n unigram features are extracted from the respective sentence of each tuple of the training data. In addition, an integer-valued feature 12 that indicates the category of the tuple is used. The correct label for the extracted training feature vector is the corresponding polarity value (e.g., "positive"). Then, for each tuple {category, OTE} of a test sentence s, a feature vector is built and classified using the trained SVM.
SB2-Slot1: The sentence-level tuples returned by the SB1 baseline are copied to the text level and duplicates are removed.

SB2-Slot3:
For each text-level aspect category c the baseline traverses the predicted sentence-level tuples of the same category returned by the respective SB1 baseline and counts the polarity labels (positive, negative, neutral). Finally, the polarity label with the highest frequency is assigned to the textlevel category c. If there are no sentence-level tuples for the same c, the polarity label is determined based on all tuples regardless of c.
The baseline systems and evaluation scripts are implemented in Java and are available for download from the SE-ABSA16 website 13 . The LibSVM package 14 (Chang and Lin, 2011) is used for SVM training and prediction. The scores of the baselines 13 http://alt.qcri.org/semeval2016/task5/index. php?id=data-and-tools 14 http://www.csie.ntu.edu.tw/~cjlin/libsvm/ in the test datasets are presented in Section 6 along with the system scores.

Participation
The task attracted in total 245 submissions from 29 teams. The majority of the submissions (216 runs) were for SB1. The newly introduced SB2 attracted 29 submissions from 5 teams in 2 languages (en and sp). Most of the submissions (168) were runs for the rest domain. This was expected, mainly for two reasons; first, the rest classification schema is less fine-grained (complex) compared to the other domains (e.g., lapt). Secondly, this domain was supported for 6 languages enabling also multilingual or language-agnostic approaches. The remaining submissions were distributed as follows: 54 in lapt, 12 in phns, 7 in came and 4 in hote. English  27  156  Arabic  3  4  Chinese  3  14  Dutch  4  16  French  5  13  Russian  5  15  Spanish  6  21  Turkish  3  6  All 29 245 lexica) and additional data of any kind could be used for training. In the latter case, the teams had to report the resources used. Delayed submissions (i.e., runs submitted after the deadline and the release of the gold annotations) are marked with "*". As revealed by the results, in both SB1 and SB2 the majority of the systems surpassed the baseline by a small or large margin and, as expected, the unconstrained systems achieved better results than the constrained ones. In SB1, the teams with the highest scores for Slot1 and Slot2 achieved similar F-1 scores (see Table 3) in most cases (e.g., en/rest, es/rest, du/rest, fr/rest), which shows that the two slots have a similar level of difficulty. However, as expected, the {Slot1, Slot2} scores were significantly lower since the linking of the target expressions to the corresponding aspects is also required. The highest scores in SB1 for all slots (Slot1, Slot2, {Slot1, Slot2}, Slot3) were achieved in the en/rest; this is probably due to the high participation and to the lower complexity of the rest annotation schema compared to the other domains. If we compare the results for SB1 and SB2, we notice that the SB2 scores for Slot1 are significantly higher (e.g., en/lapt, en/rest, es/rest) even though the respective annotations are for the same (or almost the same) set of texts. This is due to the fact that it is easier to identify whether a whole text discusses an aspect c than finding all the sentences in the text discussing c . On the other hand, for Slot3, the SB2 scores are lower (e.g., en/rest, es/rest, ru/rest, en/lapt) than the respective SB1 scores. This is mainly because an aspect may be discussed at different points in a text and often with different sentiment. In such cases a system has to identify the dominant sentiment, which usually is not trivial.

Conclusions
In its third year, the SemEval ABSA task provided 19 training and 20 testing datasets, from 7 domains and 8 languages, attracting 245 submissions from 29 teams. The use of the same annotation guidelines for domains addressed in different languages gives the opportunity to experiment also with crosslingual or language-agnostic approaches. In addition, SE-ABSA16 included for the first time a text-  level subtask. Future work will address the creation of datasets in more languages and domains and the enrichment of the annotation schemas with other types of SA-related information like topics, events and figures of speech (e.g., irony, metaphor). Entity Labels phone, display, keyboard, cpu, ports, memory, power_supply, hard_disk, multimedia_devices, battery, hardware, software, os, warranty, shipping, support, company Attribute Labels Same as in Laptops (Table 8) with the exception of portability that is included in the design_features label and does not apply as a separate attribute type. Entity Labels camera, display, keyboard, cpu, ports, memory, power_supply, battery, multimedia_devices, hardware, software, os, warranty, shipping, support, company, lens, photo, focus Attribute Labels Same as in Laptops (Table 8).