DBMS-KU at SemEval-2019 Task 9: Exploring Machine Learning Approaches in Classifying Text as Suggestion or Non-Suggestion

This paper describes the participation of DBMS-KU team in the SemEval 2019 Task 9, that is, suggestion mining from online reviews and forums. To deal with this task, we explore several machine learning approaches, i.e., Random Forest (RF), Logistic Regression (LR), Multinomial Naive Bayes (MNB), Linear Support Vector Classification (LSVC), Sublinear Support Vector Classification (SSVC), Convolutional Neural Network (CNN), and Variable Length Chromosome Genetic Algorithm-Naive Bayes (VLCGA-NB). Our system obtains reasonable results of F1-Score 0.47 and 0.37 on the evaluation data in Subtask A and Subtask B, respectively. In particular, our obtained results outperform the baseline in Subtask A. Interestingly, the results seem to show that our system could perform well in classifying Non-suggestion class.


Introduction
Nowadays, a huge number of texts are posted in online reviews or discussion forums. Such media can be a valuable source for obtaining a suggestion about products or services (Negi and Buitelaar, 2015;. The obtained suggestion is not only useful for readers but also important information for stakeholders . Indeed, such advice can be used to improving the quality of products or giving helpful recommendations (Brun and Hagege, 2013). However, identifying a suggestion from a lot of reviews or comments needs extra effort and time. Moreover, such online texts are mostly in unstructured form (Negi et al., 2018;Negi and Buitelaar, 2017). Thus, automatically mining the suggestion from given texts is challenging and significant .
Suggestion mining is relatively a new research interest in text classification tasks (Negi and Buitelaar, 2015). Several studies have initiated to mining suggestions from online texts (Negi et al., 2018;Negi and Buitelaar, 2017;Negi, 2016;Negi and Buitelaar, 2015;Brun and Hagege, 2013;Ramanand et al., 2010;Dong et al., 2013;Myaeng, 2012, 2013). Particularly, (Negi and Buitelaar, 2015;Brun and Hagege, 2013;Ramanand et al., 2010) have tried to identify suggestions from customer reviews. Meanwhile, (Negi, 2016;Dong et al., 2013;Myaeng, 2012, 2013) have mined such advice by using Twitter or discussion forums dataset. Then, (Negi et al., 2018;Negi and Buitelaar, 2017) have utilized WikiHow and open domain corpora for their work. However, they concluded that it is not easy to identify suggestion texts automatically. In other words, it still has room to improving the classification result in the suggestion mining task. The task of suggestion mining from online reviews and forums, namely, Task 9 (Negi et al., 2019), is opened in the International Workshop on Semantic Evaluation 2019 (SemEval-2019).
This paper delineates the participation of DBMS-KU team in both Subtask A and Subtask B of Task 9 of SemEval-2019 (Negi et al., 2019). To address these two Subtasks, we utilize several approaches, namely, Random Forest (RF), Logistic Regression (LR), Multinomial Naive Bayes (MNB), Linear Support Vector Classification (LSVC), Sublinear Support Vector Classification (SSVC), Convolutional Neural Network (CNN), and Variable Length Chromosome Genetic Algorithm Naive Bayes (VLCGA-NB). The obtained results of our experiments are encouraging and show a promising improvement in identifying Suggestion and Non-suggestion.
The rest of this paper is organized as follows. Section 2 explains the problem definition, problem formulation, and dataset. Section 3 presents the tools and libraries used in this work. Section 4 describes our employed methods. Section 5 repre-sents our experiments that consist of data preprocessing, parameter, and evaluation measurement. Section 6 discusses our obtained results. Finally, we conclude this work in Section 7.

Problem Definition
Suggestion mining is a binary classification problem. Particularly, suggestion mining is a task that labels sentences as Suggestion or Non-suggestion. However, suggestion sentences can have very broad meaning. Thus, the domain and scope of the suggestion text classification should be described. The Task 9 of SemEval-2019 consists of Subtask A and Subtask B that are classifying a suggestion in intra-domain and cross-domain, respectively (Negi et al., 2019).

Problem Formulation
Suggestion text classification consists of assigning suggestion, nonsuggestion to (s i , l j ) ∈ SxL, where S is sentences and L = [l 1 , ..., l n ] is a set of n predefined labels. Each sentence is classified as Suggestion or Non-suggestion class.

Datasets
Dataset used in Task 9 of SemEval-2019 is divided into training, trial, and evaluation parts (Negi et al., 2019). The dataset consists of three columns: id, sentence, and label (see Table 1). The provided dataset is imbalanced in which, overall, Non-suggestion class is larger than Suggestion one.

Tools and Libraries
The common classification methods, such as Support Vector Machine, Random Forest Classifier, Linear Regression, and Naive Bayes application are facilitated by the most outstanding library for machine learning, namely, SciKit-Learn (Pedregosa et al., 2011). Correspond to its name, NLTK (Bird et al., 2009) is used as the toolkit for Natural Language Processing (NLP) operations such as tokenization, stemming, metrics, corpus, and classification. Pandas (McKinney, 2010) is chosen as the tools for collection and format the data because of its ease of use. The Keras library (Chollet et al., 2015) that runs on top of Tensorflow (Abadi et al., 2015) is also utilized for building high-level neural networks, i.e., for building the Convolutional Neural Network in this work. Furthermore, we utilize Seaborn 1 and Matplotlib library (Hunter, 2007) as confusion matrix visualization.

Classification Methods
This section details the classification methods used in our experiments on Suggestion classification.

Baseline
Baseline method provided by organizer utilizes the suggestion keyword, pattern string, and Part-Of-Speech (POS) Tagger matching. The list of suggestion keywords utilized in the baseline is "suggest", "recommend", "hopefully", "go for", "request", "it would be nice", "adding", "should come with", "should be able", "could come with", "i need", "we need", "needs", "would like to", "would love to", "allow", and "add". The baseline method also utilizes the wishes identification pattern string from (Goldberg et al., 2009). The POS tag of each word in the sentences is also done to collect Modal and Verb POS tag only. The classification is done by checking all words in the sentence. If the sentence contains one of the three matches, then the sentence is classified as a Suggestion class.

Common Classification Methods
Common classification methods, such as Support Vector Machine (SVM), Random Forest Classifier (RF), Linear Regression (LR), and Naive Bayes (NB) are employed in this research. Two types of SVM are utilized, that is, Linear Support Vector Machine Classifier (LSVC) and Sublinear Support Vector Machine Classifier (SSVC). The implementation of the common classification methods is available at (Fatyanosa, 2019c).

Variable Length Chromosome Genetic
Algorithm-Naive Bayes Variable Length Chromosome Genetic Algorithm -Naive Bayes (VLCGA-NB) is utilized for features selection. We follow the model and parameter from (Fatyanosa et al., 2018). The first step of VLCGA-NB is selecting initial features from keywords that appear in the Suggestion sentences but do not appear in the Non-suggestion ones in the training data. Within the randomly determined maximum chromosome size, these keywords are then randomly selected as genes in Genetic Algorithm (GA). Therefore, each chromosome within population will have different length with different genes. All populations resulting from the initialization then evolve through generation by passing the crossover, mutation, and selection operator. A number of children produced by crossover and mutation operator are based on Crossover Rate (CR) and Mutation Rate (MR). Two types of crossover are utilized in this research, viz., Union Crossover and Intersection Crossover. The mutation is done by changing the genes with another feature which is not contained in the chromosome.
The gene which will be mutated within chromosome is selected by comparing the generated random value with the MR. If the random value is higher than MR, then the gene will be mutated. The purpose of the crossover operator is to help the algorithm to explore the search space, while the purpose of mutation operator is to exploit certain area in the search space. With these operators, there will be diversity within the population that can help to avoid early convergence. By ranking the fitness value using elitist selection, the next population for the next generation is selected from the prior population and the produced children.
Only the chromosome with the highest ranking within the number of population will be selected. All these operators are then iterated until the maximum number of generations. The best chromosome produced in the last generation is then used as the Suggestion keywords in the baseline code provided by the organizer. GA, which is a wellknown evolutionary algorithm, is one of the powerful stochastic and heuristic algorithms. The use of GA is legion as it can provide search space exploration through crossover operator and exploitation through mutation operator. Thus, GA is possible to search in a very wide search space and allow it to produce nearly optimal results. This ability becomes the motivation for feature selection using GA. However, the drawback of the GA is that it is not guaranteed to produce the global optimal, but instead satisfactory results. Moreover, GA requires parameter tuning to find the appropriate parameter based on the dataset and needs a longer runtime. Despite its drawback, we expect that GA can produce a limited number of Suggestion features which has a major contribution to the Suggestion classification in this work. The implementation of the VLCGA-NB is available at (Fatyanosa, 2019b).

CNN
For our purpose in this work, we follow the Keras model's architecture from (Chollet, 2017) as shown in Figure 1. This architecture tends to obtain high accuracy when applied on the Newsgroup dataset. The text classification using CNN is done in four steps. First, all sentences are converted into word index order. Only 20,000 frequently words with the upper limit length of 1000 words will be considered. Next, 100-dimensional Global Vectors for Word Representation (GloVe) embeddings are utilized as the embedding matrix. Then, this matrix is loaded into Embedding layer of Keras. Finally, the Softmax function is used in the final layer of CNN. The implementation of the CNN is available at (Fatyanosa, 2019a). Although there are several pre-trained word vectors, GloVe and word2vec are considered as the most popular vectors (Lee et al., 2016). Based on (Pennington et al., 2014), their GloVe vector has outperformed other word representations in terms of word comparison, correlation, and named entity recognition. We thus use the GloVe vector as our pre-trained word embedding in this work.

Experiments
We conducted the experiments with the seven classification methods in this section. We employed the datasets from both Subtasks for evaluating the Figure 1: CNN Architecture from (Chollet, 2017) performance of those classification methods. We compared the performance of seven classification methods against the baseline provided by the organizer.
The series of VLGCA-NB preprocessing was different from other classification methods because it did not need the vector form of words. All words were converted to lowercase. Number, stop words, punctuation, non-English words, nonalphabetic characters, and words smaller than two characters were removed, lemmatization was performed, and all contractions were replaced by their real words or phrases. The number of features was decreased to 493 after the preprocessing step.

Parameters
To apply CNN to the classification, we used fixed parameters with maximumEpoch = 10 and batchSize = 100.
We also employed fixed parameters for VLCGA-NB with P opulationsize = 100, Generationsize = 50, Crossoverrate = 0.7, and M utationrate = 0.3. RF, CNN, and VLCGA-NB are stochastic algo-rithms which mean the results will differ for each run. Therefore, those algorithms were run five times. The best result among the five attempts was selected for comparison with other algorithms.

Evaluation Measurement
Classifier performance evaluation using accuracy is often considered as a suited measurement. However, the datasets from both subtasks were imbalanced. Majority class is often reckoned by the classifier, thus, higher accuracy will be achieved for it. Therefore, in this research, we used Precision, Recall, and F1-Score as the main evaluation measures. Accuracy measurement (Equation (1)) was still used in the fitness function of VLCGA-NB. F1-Score (Equation (4)) computation relied on Precision (Equation (2)) and Recall (Equation (3)) measurements. As the Suggestion results were more concerned, the evaluation of this competition was the Suggestion results' F1-Score.
where : Evaluation measurement for VLCGA-NB was done in every generation using Fitness Function based on the result of Naive Bayes classification.
The Fitness value was found by addition of accuracy, F1-Score of Suggestion and F1-Score of Non-suggestion, which were defined as follows: F itness = Accuracy + F 1-Scoresuggestion + F 1-Scorenon−suggestion (5)

Results and Discussion
In this section, we evaluated the classification performance of the seven classification methods with the baseline for both Subtasks. The methods performance was evaluated on Precision, Recall, and F1-Score metrics, except for the VLCGA-NB, we still used accuracy in the fitness function. A typical observation from the confusion matrix produced in this research was that the number of correct classification was higher than the number of misclassified for the Non-suggestion class, except for baseline of Subtask A. This result was unvaried across all methods, with a little difference of the whole classification count. Though Subtask A aimed to classify Suggestion in the same domain, the number of correct classification of Suggestion class was lower than the number of misclassified for most of the classification methods. In particular, baseline and SSVC obtained better results than other utilized methods. Furthermore, as Subtask B aimed to classify Suggestion in the different domain, eventually it was hard for all classification methods to obtain even fair results. All of them failed to obtain a higher number of correct classification of the Suggestion class. Table 2 shows the precision, recall, and F1-Score comparisons for each class in Subtask A. We noted that the obtained result of all classification methods outperformed that of the baseline for the Non-suggestion class. MNB yielded the best results with 0.95.
In terms of F1-Score of the Suggestion class, refer to Table 2, we noted that RF, SSVC, and VLCGA-NB obtained a competitive result outperforming baseline for Suggestion class. The highest F1-Score was obtained by SSVC at 0.47. RF and VLCGA-NB produced F1-Score at 0.29 and 0.31, respectively. Overall, note that MNB and SSVC obtained the best F1-Score for Non-suggestion and Suggestion classes, respectively. Table 3 shows the experimental results for the dataset in Subtask B. We noted that the F1-Score of Non-suggestion yielded a good result with the higher F1-Score obtained by RF classifier. However, the F1-Score of Suggestion class produced poor finding, except for the baseline. F1-Score of Suggestion class using the baseline achieved surprisingly well considering their simplicity, which yielded 0.73. A possible reason for this finding might be that the manually selected keywords and patterns based on which usually used to suggest something would make use of the common Suggestion sentence that a machine might not be able to discover. A possible problem with the baseline approach was probably that the keywords and patterns for Suggestion class might be also used for Non-suggestion class. Therefore, it might be difficult for baseline to define which keywords and patterns actually used in Suggestion sentences. This could be proven from the F1-Score results for the Non-suggestion class in Subtask A which yielded the lowest result of 0.59.
Regarding the number of features selected by VLCGA-NB, the features were decreased from 493 to 372. Refer to our defined expectation in 4.3, VLCGA-NB was able to produce a limited number of features which has major contribution to the Suggestion classification. This could be proven from its F1-Score results which yielded higher value compared to the baseline in Subtask A.

Conclusion
This paper has described our approach for participating in both Subtask A and Subtask B of Task 9 of SemEval-2019, that is, suggestion mining from online reviews and forum (Negi et al., 2019). Our approach explored and compared various classification methods, namely, Random Forest, Logistic Regression, Multinomial NB, Linear SVC, Sublinear SVC, CNN, and VLCGA-NB. Since the datasets provided by the organizer were imbalanced data, it was more important to correctly classify a sentence as a Suggestion class. Thus, the F1-Score of the Suggestion class was more considered. Compared to the baseline, all algorithms performed better classification for the Nonsuggestion class in both Subtask A and Subtask B. In contrast, they performed worse classification than the baseline for the Suggestion class in both Subtask A and Subtask B. This poor performance was except for the RF, SSVC, and VLCGA-NB that could outperform the baseline for classifying the Suggestion class in Subtask A. Based on our results, we observed that besides the imbalanced data, the implicit meaning problem related to the  Suggestion class was also the challenge of the suggestion mining. The feature selection corresponds with the Suggestion class will be our future intention. In addition, it might be valuable to inspect further the use of our approach to other text classification tasks such as deceptive opinions (Siagian and Aritsugi, 2017, 2018) and fake news identifications.