Building Large-Scale English and Korean Datasets for Aspect-Level Sentiment Analysis in Automotive Domain

We release large-scale datasets of users’ comments in two languages, English and Korean, for aspect-level sentiment analysis in automotive domain. The datasets consist of 58,000+ commentaspect pairs, which are the largest compared to existing datasets. In addition, this work covers new language (i.e., Korean) along with English for aspect-level sentiment analysis. We build the datasets from automotive domain to enable users (e.g., marketers in automotive companies) to analyze the voice of customers on automobiles. We also provide baseline performances for future work by evaluating recent models on the released datasets.


Introduction
Aspect-level sentiment analysis (ALSA) has been actively studied to understand authors' opinion on aspects from texts. For example, in a given text, "Although the space is smaller than most, it is the best service you will find in even the largest restaurants", the author's sentiment to space and service is negative and positive, respectively. Since devising deep learning models for ALSA recently received substantial attention (Zeng et al., 2019), building large-scale datasets in different languages has been an essential line of research (Rosenthal et al., 2017). However, due to the high cost of human annotation, the size and language of datasets are still limited. Specifically, only one of public datasets contains more than 20,000 instances, and existing datasets cover only three languages (i.e., English, Spanish, and Arabic).
To this end, we release large-scale datasets of users' comments with two languages such as English and Korean in automotive domain. The total size of these datasets is 58,603, which is the largest compared to existing datasets for ALSA. In addition, the datasets include new language (i.e., Korean), which extends the coverage of the datasets in terms of languages. In this work, we focus on automotive domain to build the datasets to enable users (e.g., marketers in automotive companies) to analyze the voice of customers on aspects related to automobiles.
To build the datasets, domain experts define the 12 largest automotive manufacturers (e.g., Ford) by production volume and their popular automobiles (e.g., Mustang) as aspects. Given the aspects, we collect users' comments from automotive communities in the United States and South Korea. To annotate aspect-level sentiments, we perform crowdsourcing by assigning at least three annotators to each comment-aspect pair. The annotated datasets consist of 28,571 and 30,032 comment-aspect pairs in English and Korean, respectively. Inter-annotator agreements are 0.36 (fair agreement) for English dataset and 0.54 (fair agreement) for Korean dataset in terms of Fleiss' kappa 1 .
We perform extensive experiments with deep learning models for ALSA to provide the baseline performance on the released datasets for future work. The datasets are publicly available at our website.

Related Datasets
Researchers in ALSA have built labeled datasets as supervised learning has been a major approach to tackle ALSA. Table 1 tabulates the summary of prominent datasets for ALSA. Mitchell et al. (2013)   annotated sentiments for aspects related to organizations and persons. Dong et al. (2014) annotated tweets that include aspects related to celebrities and products. Semantic Evaluation (SemEval) in 2014 (Pontiki et al., 2014) built two English datasets from laptop and restaurant domains. SemEval in 2017 (Rosenthal et al., 2017) built the largest dataset consisting of English tweets and a non-English dataset consisting of Arabic tweets with popular events as aspects such as named entities (e.g., iPhone) and geopolitical entities (e.g., Palestine).  built a dataset to include multiple aspects in each text (i.e., tweet), and they define aspects related to UK election (e.g., greens and labour). Similar to , Jiang et al. (2019) included multiple aspects in each text, and the aspects are related to restaurants (e.g., food and service). We note that a recent work built a dataset for ALSA in Korean (Song et al., 2019), but we omit it from our comparison as the dataset is not publicly available.
In this work, we release large-scale datasets consisting of users' comments in English and Korean from automotive domain. The total size of our datasets is 58,000+, which is the largest compared to the other datasets. The datasets also include new non-English language (i.e., Korean) in addition to Spanish and Arabic. We believe that the released datasets further relieve the lack of large-scale and non-English datasets for ALSA.  To cover a wide range of automotive domain, experts from Hyundai, a Korean automotive company, defined the 12 largest automotive manufacturers (e.g., Ford) by production volume and their popular automotive models (e.g., Mustang) as aspects in both English and Korean. The predefined list contains 12 automotive manufacturers and 341 automotive models. Table 2 tabulates the examples of the aspects in English.

Data Collection
We selected two online communities specializing in automobiles to collect users' comments: Reddit 3 for English comments and Bobae-Dream 4 for Korean comments. From the online communities, we crawled about 0.3M English comments and 1M Korean comments where each comment contains at least one of the predefined aspects. We note that the number of English comments was relatively small because Reddit restricts viewing of old posts. We randomly sampled about 30,000 comments for each language for annotation.

Annotation Using Crowdsourcing
We performed crowdsourcing to annotate a sentiment for each comment-aspect pair. For English comments, we used CrowdFlower for the annotation by following (Rosenthal et al., 2017) as in Figure 1. For Korean comments, we designed web pages for the annotation because there was no crowdsourcing company in South Korea. Annotators for English comments were English natives in CrowdFlower and annotators for Korean comments were Korean natives in POSTECH, a university in South Korea.

Overview
In this job, you will be presented with comments and en ty that come from car community. Each en ty is either a car model or a car manufacturer. Review the comment to determine the author's sen ment toward an en ty. This task is different from assigning a sen ment in sentence level. The workers should consider a sen ment toward an en ty in the given comment, not overall sen ment in a sentence. Annotators were asked to choose one of four choices (e.g., positive, neutral, negative, and wrong entity) for each comment-aspect pair (Figure 1). Wrong entity choice was designed to filter out comments that do not include automotive aspects. For example, Morning, a kind of automobiles, can be used as a general word such as "I went to a car repair shop this morning". In this case, the correct choice is wrong entity as morning is not an automotive aspect. We also guided annotators to select neutral sentiment when a given comment-aspect pair does not belong to any other choices (e.g., positive, negative, and wrong entity). For quality control, we evaluate annotators with hidden tests, which are comment-aspect pairs annotated by us, and reject annotators who missed a large number of the tests. Each comment-aspect pair was annotated by at least three annotators.

A D D Q U E S T I O N
We used the majority voting scheme to consolidate the annotations for each comment-aspect pair as done in (Rosenthal et al., 2015). The inter-annotator agreements are 0.36 (fair agreement) for English dataset and 0.54 (fair agreement) for Korean dataset in terms of Fleiss' kappa. We speculate that the lower agreement rate for English dataset is due to the quality control being difficult because CrowdFlower allocated a large number of annotators (3,172 annotators) compared to a small number of annotators (10 annotators) for Korean dataset. Lastly, we exclude the comment-aspect pairs labeled as the wrong entity because they are irrelevant to ALSA.   Table 3 shows the statistics of the annotated datasets in English and Korean 5 . The numbers of aspects after the annotation are 128 and 219 in English and Korean, respectively. We randomly divided each annotated dataset into training data (80%) and test data (20%).

Experimental Settings
Evaluation Protocol We used accuracy and macro-F1, which have been major metrics for evaluating ALSA models (Li et al., 2018;Zeng et al., 2019). We randomly sampled 10% of training data as validation data. We also ran each model 10 times and reported the mean and standard deviation. Baseline Models We selected deep learning models such as BERT-based models (i.e., AEN-BERT and LCF-BERT) and non-BERT-based models (i.e., the other models in Table 4). For the BERT-based models, we used original BERT (Devlin et al., 2019) (Mikolov et al., 2013) on the English and Korean corpora, which are crawled in this work, to obtain 100-dimensional word embedding vectors for each language, and used them to initialize words for the non-BERT-based models. In case of the BERT-based models, we used pretrained word embedding vectors included in the BERTs (i.e., original and multilingual BERT).

Performance Analysis
In Table 4, we provide the classification performance of the baseline models on the released datasets. On English dataset, the best performing model is LCF-BERT, which indicates the importance of designing ALSA models based on BERT. However, on Korean dataset, non-BERT-based models (i.e., TD-LSTM and TNet-LF) show the best performance. We speculate that the multilingual BERT is inferior to the original BERT, and investigate the performance of LCF-BERT with the multilingual BERT instead of the original BERT on English dataset. LCF-BERT with the multilingual BERT produces 64.88% of accuracy and 54.39% of macro-F1 on English dataset, which are lower than those of original LCF-BERT in Table 4. This result denotes pretraining BERT only on a target language is important to obtain better performances on the dataset in the target language. Thus, future work should pretrain BERT on largescale Korean corpus to obtain higher performances on the released Korean dataset.

Conclusion
We release large-scale datasets consisting of 58,000+ comments in English and Korean from automotive domain. The total size of the datasets is currently the largest, and the datasets include new non-English (i.e., Korean) language for ALSA. For future work, we also provide the baseline performances on the released datasets using deep learning models for ALSA.