ZQM at SemEval-2019 Task9: A Single Layer CNN Based on Pre-trained Model for Suggestion Mining

This paper describes our system that competed at SemEval 2019 Task 9 - SubTask A: ”Sug- gestion Mining from Online Reviews and Forums”. Our system fuses the convolutional neural network and the latest BERT model to conduct suggestion mining. In our system, the input of convolutional neural network is the embedding vectors which are drawn from the pre-trained BERT model. And to enhance the effectiveness of the whole system, the pre-trained BERT model is fine-tuned by provided datasets before the procedure of embedding vectors extraction. Empirical results show the effectiveness of our model which obtained 9th position out of 34 teams with F1 score equals to 0.715.


Introduction
Suggestion mining is defined as the extraction of suggestions from unstructured text (Negi et al., 2018). Suggestion mining is still a relatively young research area as compared to other natural language processing issues like sentiment analysis (Negi and Buitelaar, 2015). While suggestion mining is of great commercial value for organisations to improve the quality of their entities by considering the positive and negative opinions collected from platforms. The target of this task is to automatically classify the sentences collected from online reviews and forums into two classes which are suggestion and non-suggestion respectively (Negi et al., 2019).
BERT which stands for Bidirectional Encoder Representation from Transformers is the latest breakthrough in the field of NLP provided by Google Research (Devlin et al., 2018). It has substantially advanced the state-of-the-art in many NLP tasks, especially in question answering (Alberti et al., 2019). More importantly, it provides a widely applicable tool for representation learning which can be generalized to many NLP tasks.
For this subtask, we firstly learn the word or sentence embeddings utilizing the pre-trained BERT model. Then the embedding vectors are extracted from BERT as the input of the subsequent model. It is worth noting that we have finetuned the pre-trained BERT model with provided dataset before the embedding vectors are extracted. In other words, this part is equivalent to the conventional embedding layer. This strategy is a little bit like ELMO (Peters et al., 2018). As for the upper layer of this system, convolutional neural network (CNN) is adopted herein to process the features. Although CNN is originally invented for tackling computer vision issues, while it has subsequently been shown to be effective for many NLP tasks (Kim, 2014;Zhang and Wallace, 2015;Dong et al., 2015).
The remainder of the paper is organized as follows. Section 2 describes the detailed architecture of our system. Section 3 reports the experimental results on the given datasets. Finally, conclusions are drawn in Section 4. Figure 1 gives a high-level overview of our approach. And we elaborate the details of implementation which mainly consists of following steps:

System Description
(1) the preprocessing of raw data, (2) the word embedding learning via BERT model, (3) feature processing via CNN and sentences classification.

Data Preprocessing
The provided dataset is collected from feedback posts on Universal Windows Platform and annotated by (Negi et al., 2018). But the text is not standard enough as there are some spelling mistakes and few duplicate samples. To boost the performance of our system, we conduct some pre- processing steps on the raw data. At first, web links are removed through regular expression as it does not contribute to the accuracy of classification. After that, we can take more meaningful words into consideration under the condition of a finite sentence length. And ekphrasis 1 , a text processing tool, is utilized to conduct spelling correction (Baziotis et al., 2017). Then, all characters are converted to lowercase. Finally, duplicated samples would be excluded from the dataset.

Embedding Learning via BERT
Embedding layer usually encodes each word into a fixed-length vector for subsequent study. Word2vec, Glove and FastText are the most simple and popular word embedding algorithms (Mikolov et al., 2013;Pennington et al., 2014;Bojanowski et al., 2017). While there continue to be some drawbacks, such as they cannot encode the contextual information well. Recently, a few effective algorithms have been put forward such as EL-MO and openAI GPT (Peters et al., 2018;Radford et al., 2018). These two pre-trained language models can encode rich syntactic and semantic information and distinguish the different meanings of a polysemy in diverse contexts, which traditional word embeddings methods cannot handle well. ELMO leverages the concatenation of independently trained left-to-right and right-to-left LSTM to generate features. Though LSTM can capture contextual information, the performance 1 github.com/cbaziotis/ekphrasis can be limited by the long distance of sequences to some extent. In order to deal with this problem, openAI GPT substitute Transformer for LSTM. Transformer rely entirely on attention mechanism to capture global dependencies (Vaswani et al., 2017).
BERT is the latest language representation model which also takes advantage of Transformer. Besides that, it uses the masked language model (MLM) and the bidirectional Transformers to capture the contextual information which has been proved to be effective (Devlin et al., 2018). In our system, we employ BERT as the embedding layer, in other words, we use the output of the last transformer layer from BERT as word embedding vectors. The version of pre-trained BERT model we used is BERT-BASE which has 12 layers of transformer blocks and the hidden size is equal to 768. It implies that the dimension of the output embedding vectors is equal to 768. We choose not to cover too much details of BERT as it has been elaborated on its website 2 .
To boost the performance of our system by making this model better fit our data, a fine-tuning step is conducted before extracting the word embedding vectors from pre-trained model. And the finetuning parameters are given in Section 3.2.

Feature processing via CNN
CNN is originally invented for tackling the issues in the field of computer vision, while various C-NN models have subsequently been proven to be effective for many NLP tasks (Kalchbrenner et al., 2014;dos Santos and Gatti, 2014). In our work, we train a single layer CNN on the word embedding vectors drawn from BERT model. Let e i ∈ R k be the k-dimensional word embedding vector corresponding to the i-th word in the sentence. Then the vector representation of a sentence is denoted as Eq.1: where n represents the length of sentences, and denotes concat operation which can maintain the order of words in text. After that a filter involved in one-dimensional convolution operation is defined as w ∈ R m×k , then the convolution process can be defined as a function as Eq.2 (Kim, 2014): where p i is a scalar which stands for the new local feature generated by a filter from a window of words e i:i+m−1 , in other words, only i-th to i + m − 1-th words have been taken into consideration when generate i-th local feature. m is filter size which denotes m words is taken into calculation when generating a local feature. b is a bias and f is an activation function, it is tanh exactly in our system. Finally, there are n − m + 1 local features generated by a filter totally. Those local features can be concatenated as a global feature P : where P ∈ R n−m+1 represents a feature map generated by a filter. Then a max-over-time maxpooling operation (Collobert et al., 2011) is applied to the feature maps which means that only the maximum value of P is reserved. If there are N f filters, then N f maximum values is generated through maxpooling operations. Those values can be organized as a new vector Q ∈ R N f as Eq.4:

Dense layers
The pooling layer is followed by two dense layers with different number of neurons. Dropout (Hinton et al., 2012) is utilized to alleviate overfitting problem before the first dense layer. And we have tried different dropout rates to search the best configuration. Firstly, the output of pooling layer Q is fed into the first dense layer with 200 hidden neurons. The activation function of this dense layer is relu . Next is the second dense layer with two neurons and the corresponding activation function is softmax. Final output is the probability of which class the sample belongs to.

Experiment Results
As mentioned in Section 2.2, we conduct finetuning operation before extracting the word embedding vector from BERT. For fine-tuning, most hyperparameters are the same as the parameters  of pre-trained model, with the exception of batch size, learning rate and number of training epochs. The mini-batch size is set at 32 and learning rate is set at 5e-5, the number of training epochs is configured as 5. The maximal length of sentences is configured as 50 and if the length of a sentence is less than 50, it will be padded with zero; otherwise, it will be truncated from the tail. For the CNN component, the filter size m is configured as 5 and the number of filters N f is 64. And the dropout rates we have tried are ranged from 0.1 to 0.7 with a step of 0.1. The experimental results are shown in Table 2. In order to prove the effectiveness of word embeddings derived from BERT, we also employ Word2vec as comparison and the corresponding best dropout rate is 0.4. Obviously, no matter what the dropout rate is, our model consistently outperforms other models. And the best dropout rate of our model is around 0.1 for abovementioned configuration. While the best dropout rate may vary with other parameter configuration like filter size and the number of filters. Table 3 shows the Precision, Recall and F1 score in term of different classes. Obviously, the system performance on negative samples is better than the performance on positive samples which is consistent with our intuition for the reason that the number of negative samples overwhelms the number of positive samples. Therefore, our system can learn more features of negative samples, which can help it recognize those negative samples accurately.
We also investigate the impact of the number of filters N f on the classification performance with fixing the filter size m at 5 and the dropout rate at 0.1. The experimental result is shown as Figure 2. This model yields the best performance when the number of filters is equal to 64. Less filters cannot capture enough information while too many filters can result in information redundancy to some extent.
The impact of filter size on classification accuracy is shown as Figure 3. The most suitable filter size which means the window size of convolutional operation is m = 5. It is difficult for this system to catch the global semantic information if the window size is too small. While some local semantic information would be ignored if the window size of filter is too large. Hence, choosing a filter with moderate size is helpful for the performance improvement.

Conclusions
In this paper, we have proposed a neural model based on the pre-trained BERT model to deal with suggestion mining task. Our system can learn the representation of sentences or words effectively by leveraging BERT. Then the representations extracted from BERT is fed into a simple CNN layer. Experimental results show that our system is efficient on the given dataset.
As for future work, it is of necessity to tackle the imbalance between positive samples and negative samples through oversampling or undersampling. And we intend to study some innovative ways to incorporate BERT model like extracting not only the output of the last transformer layer but also the output of different transformer layers and integrating them with different weights.