ThisIsCompetition at SemEval-2019 Task 9: BERT is unstable for out-of-domain samples

This paper describes our system, Joint Encoders for Stable Suggestion Inference (JESSI), for the SemEval 2019 Task 9: Suggestion Mining from Online Reviews and Forums. JESSI is a combination of two sentence encoders: (a) one using multiple pre-trained word embeddings learned from log-bilinear regression (GloVe) and translation (CoVe) models, and (b) one on top of word encodings from a pre-trained deep bidirectional transformer (BERT). We include a domain adversarial training module when training for out-of-domain samples. Our experiments show that while BERT performs exceptionally well for in-domain samples, several runs of the model show that it is unstable for out-of-domain samples. The problem is mitigated tremendously by (1) combining BERT with a non-BERT encoder, and (2) using an RNN-based classifier on top of BERT. Our final models obtained second place with 77.78% F-Score on Subtask A (i.e. in-domain) and achieved an F-Score of 79.59% on Subtask B (i.e. out-of-domain), even without using any additional external data.


Introduction
Opinion mining (Pang and Lee, 2007) is a huge field that covers many NLP tasks ranging from sentiment analysis (Liu, 2012), aspect extraction (Mukherjee and Liu, 2012), and opinion summarization (Ku et al., 2006), among others. Despite the vast literature on opinion mining, the task on suggestion mining has given little attention. Suggestion mining (Brun and Hagège, 2013) is the task of collecting and categorizing suggestions about a certain product. This is important because while opinions indirectly give hints on how to improve a product (e.g. analyzing reviews), suggestions are direct improvement requests (e.g. tips, advice, recommendations) from people who have used the product.
To this end, Negi et al. (2019) organized a shared task specifically on suggestion mining called SemEval 2019 Task 9: Suggestion Mining from Online Reviews and Forums. The shared task is composed of two subtasks, Subtask A and B. In Subtask A, systems are tasked to predict whether a sentence of a certain domain (i.e. electronics) entails a suggestion or not given a training data of the same domain. In Subtask B, systems are tasked to do suggestion prediction of a sentence from another domain (i.e. hotels). Organizers observed four main challenges: (a) sparse occurrences of suggestions; (b) figurative expressions; (c) different domains; and (d) complex sentences. While previous attempts (Ramanand et al., 2010;Brun and Hagège, 2013;Negi and Buitelaar, 2015) made use of human-engineered features to solve this problem, the goal of the shared task is to leverage the advancements seen on neural networks, by providing a larger dataset to be used on dataintensive models to achieve better performance. This paper describes our system JESSI (Joint Encoders for Stable Suggestion Inference). JESSI is built as a combination of two neural-based encoders using multiple pre-trained word embeddings, including BERT (Devlin et al., 2018), a pre-trained deep bidirectional transformer that is recently reported to perform exceptionally well across several tasks. The main intuition behind JESSI comes from our finding that although BERT gives exceptional performance gains when applied to in-domain samples, it becomes unstable when applied to out-of-domain samples, even when using a domain adversarial training (Ganin et al., 2016) module. This problem is mitigated using two tricks: (1) jointly training BERT with a CNNbased encoder, and (2) using an RNN-based encoder on top of BERT before feeding to the classifier.
JESSI is trained using only the datasets given on the shared task, without using any additional external data. Despite this, JESSI performs second on Subtask A with an F1 score of 77.78% among 33 other team submissions. It also performs well on Subtask B with an F1 score of 79.59%.

Related Work
Suggestion Mining The task of detecting suggestions in sentences is a relatively new task, first mentioned in Ramanand et al. (2010) and formally defined in Negi and Buitelaar (2015). Early systems used manually engineered patterns (Ramanand et al., 2010) and rules (Brun and Hagège, 2013), and linguistically motivated features (Negi and Buitelaar, 2015) trained on a supervised classifier (Negi et al., 2016). Automatic mining of suggestion has also been suggested (Dong et al., 2013). Despite the recent successes of neuralbased models, only few attempts were done, by using neural network classifiers such as CNNs and LSTMs (Negi et al., 2016), by using partof-speech embeddings to induce distant supervision (Negi and Buitelaar, 2017). Since neural networks are data hungry models, a large dataset is necessary to optimize the parameters. SemEval 2019 Task 9 (Negi et al., 2019) enables training of deeper neural models by providing a much larger training dataset.

Domain Adaptation
In text classification, training and test data distributions can be different, and thus domain adaptation techniques are used. These include non-neural methods that map the semantics between domains by aligning the vocabulary (Basili et al., 2009;Pan et al., 2010) and generating labeled samples (Wan, 2009;Yu and Jiang, 2016). Neural methods include the use of stacked denoising autoencoders (Glorot et al., 2011), variational autoencoders (Saito et al., 2017;Ruder and Plank, 2018). Our model uses a domain adversarial training module (Ganin et al., 2016), an elegant way to effectively transfer knowledge between domains by training a separate domain classifier using an adversarial objective.
Language Model Pretraining Inspired from the computer vision field, where ImageNet (Deng et al., 2009) is used to pretrain models for other tasks (Huh et al., 2016), many recent attempts in the NLP community are successful on using language modeling as a pretraining step to extract  feature representations (Peters et al., 2018), and to fine-tune NLP models (Radford et al., 2018;Devlin et al., 2018). BERT (Devlin et al., 2018) is the most recent inclusion to these models, where it uses a deep bidirectional transformer trained on masked language modeling and next sentence prediction objectives. Devlin et al. (2018) reported that BERT shows significant increase in improvements on many NLP tasks, and subsequent studies have shown that BERT is also effective on harder tasks such as open-domain question answering (Yang et al., 2019), multiple relation extraction (Wang et al., 2019), and table question answering (Hwang et al., 2019), among others. In this paper, we also use BERT as an encoder, show its problem on out-of-domain samples, and mitigate the problem using multiple tricks.
a large pre-trained language model, (2) A CNNbased encoder that learns task-specific sentence representations, (3) an MLP classifier that predicts the label given the joint encodings, and (4) a domain adversarial training module that prevents the model to distinguish between the two domains.
BERT-based Encoder Fine-tuning a pretrained BERT (Devlin et al., 2018) classifier then using the separately produced classification encoding [CLS] has shown to produce significant improvements. Differently, JESSI uses a pretrained BERT as a word encoder, that is instead of using [CLS], we use the word encodings e n produced by BERT. BERT is still fine-tuned during training.
We append a sentence encoder on top of BERT, that returns a sentence representation s (b) , which is different per subtask. For Subtask A, we use a CNN encoder with max pooling (Kim, 2014) to create the sentence embedding. For Subtask B, we use a bidirectional simple recurrent units (Lei et al., 2018, BiSRU), a type of RNN that is highly parallelizable, as the sentence encoder.

CNN-based Encoder
To make the final classifier more task-specific, we use a CNN-based encoder that is trained from scratch. Specifically, we employ a concatenation of both pre-trained GloVe (Pennington et al., 2014) and CoVe (Mc-Cann et al., 2017) word embeddings as input w i , 1 ≤ i ≤ n. Then, we do convolution operations Conv(w i , h j ) using multiple filter sizes h j to a window of h j words. We use different paddings for different filter sizes such that the number of output for each convolution operation is n. Finally, we concatenate the outputs to obtain the word encodings, i.e. e where ⊕ is the sequence concatenate operation.
We pool the word encodings using attention mechanism to create a sentence representation s (c) . That is, we calculate attention weights using a latent variable v that measures the importance of the words e (c) where f (·) is a nonlinear function. We then use a i to weight-sum the words into one encoding, i.e., Suggestion Classifier Finally, we use a multilayer perceptron (MLP) as our classifier, using a concatenation of outputs from both the BERT-and CNN-based encoders, i.e., p(y) = MLP y ([s (b) ; s (c) ]). Training is done by minimizing the cross entropy loss, i.e., L = − log p(y).
Domain Adversarial Training For Subtask B, the model needs to be able to classify outof-domain samples.
Using the model as is decreases performance significantly because of cross-domain differences. To this end, we use a domain adversarial training module (Ganin et al., 2016) to prevent the classifier on distinguishing differences between domains. Specifically, we create another MLP classifier that classifies the domain of the text using the concatenated sentence encoding with reverse gradient function GradRev(·), i.e., p(d) = MLP d (GradRev ([s (b) ; s (c) ])). The reverse gradient function is a function that performs equivalently with the identity function when propagating forward, but reverses the sign of the gradient when propagating backward.
Through this, we eliminate the possible ability of the classifier to distinguish the domains of the text. We train the domain classifier using the available trial datasets for each domain. We also use a cross entropy loss as the objective of this classifier. Overall, the objective of JESSI is to minimize the following loss: L = − log p(y) − λ log p(d), where λ is set increasingly after each epoch, following Ganin et al. (2016). We use dropout (Srivastava et al., 2014) on all nonlinear connections with a dropout rate of 0.5. We also use an l 2 constraint of 3. During training, we use mini-batch size of 32. Training is done via stochastic gradient descent over shuffled minibatches with the Adadelta (Zeiler, 2012) update rule. We perform early stopping using the trial sets. Moreover, since the training set is relatively small, multiple runs lead to different results. To handle this, we perform an ensembling method as follows. We first run 10-fold validation over the training data, resulting into ten different models. We then pick the top three models with the highest performances, and pick the class with the most model predictions.

Experiments
In this section, we show our results and experiments. We denote JESSI-A as our model for Subtask A (i.e., BERT→CNN+CNN→ATT), and JESSI-B as our model for Subtask B (i.e., BERT→BISRU+CNN→ATT+DOMADV). The performance of the models is measured and compared using the F1-score. Table 2 ablations on our models. Specifically, we compare JESSI-A with the same model, but without the CNN-based encoder, without the BERT-based encoder, and with the CNN sentence encoder of the BERT-based encoder replaced with the BiSRU variant. We also compare JESSI-B with the same  Ablation results for both subtasks using the provided trial sets. The + denotes a replacement of the BERT-based encoder, while the -denotes a removal of a specific component.

Ablation Studies We present in
model, but without the CNN-based encoder, without the BERT-based encoder, without the domain adversarial training module, and with the BiSRU sentence encoder of the BERT-based encoder replaced with the CNN variant. The ablation studies show several observations. First, jointly combining both BERT-and CNN-based encoders help improve the performance on both subtasks. Second, the more effective sentence encoder for the BERT-based encoder (i.e., CNN versus BiSRU) differs for each subtask; the CNN variant is better for Subtask A, while the BiSRU variant is better for Subtask B. Finally, the domain adversarial training module is very crucial in achieving a significant increase in performance.
Out-of-Domain Performance During our experiments, we noticed that BERT is unstable when predicting out-of-domain samples, even when using the domain adversarial training module. We show in Table 3 the summary statistics of the F-Scores of 10 runs of the following models:  Table 3: Summary statistics of the F-Scores of 10 runs of different models on the trial set of Subtask B when doing a 10-fold validation over the available training data. All models include the domain adversarial training module (+DOMADV), which is omitted for brevity.
bly, achieving varying F-Scores as low as zero and as high as 70.59, with a standard deviation of 31. Appending a CNN-based sentence encoder (i.e., BERT→CNN) increases the performance, but worsens the stability of the model. Appending an RNN-based sentence encoder (i.e., BERT→BISRU) both increases the performance and improves the model stability. Finally, combining a separate CNN-based encoder (i.e., JESSI-B) improves the performance and stability further. Table 4 presents how JESSI compared to the top performing models during the competition proper. Overall, JESSI-A ranks second out of 33 official submissions with an F-Score of 77.78%. Although we were not able to submit JESSI-B during the submission phase, JESSI-B achieves an F-Score of 79.59% on the official test set. This performance is similar to the performance of the model that obtained sixth place in the competition. We emphasize that JESSI does not use any labeled and external data for Subtask B, and thus is just exposed to the hotels domain using the available unlabeled trial dataset, containing 808 data instances. We expect the model to perform better when additional data from the hotels domain.

Test Set Results
Performance by Length We compare the performance of models on data with varying lengths to further investigate the increase in performance of JESSI over other models. More specifically, for each range of sentence length (e.g., from 10 to 20), we look at the accuracy of JESSI-A, BERT→BISRU, and BERT→CNN on Subtask A, and the accuracy of JESSI-B, BERT→BISRU, and BERT→CNN, all with domain adversarial training module, on Subtask B. Figure 2 shows the  plots of the experiments on both subtasks. On both experiments, JESSI outperforms the other models when the sentence length is short, suggesting that the increase in performance of JESSI can be attributed to its performance in short sentences. This is more evident in Subtask B, where the difference of accuracy between JESSI and the next best model is approximately 20%. We can also see a consistent increase in performance of JESSI over other models on Subtask B, which shows the robustness of JESSI when predicting out-of-domain samples.

Conclusion
We presented JESSI (Joint Encoders for Stable Suggestion Inference), our system for the Se-mEval 2019 Task 9: Suggestion Mining from Online Reviews and Forums. JESSI builds upon jointly combined encoders, borrowing pre-trained knowledge from a language model BERT and a translation model CoVe. We found that BERT alone performs bad and unstably when tested on out-of-domain samples. We mitigate the problem by appending an RNN-based sentence encoder above BERT, and jointly combining a CNNbased encoder. Results from the shared task show that JESSI performs competitively among participating models, obtaining second place on Subtask A with an F-Score of 77.78%. It also performs well on Subtask B, with an F-Score of 79.59%, even without using any additional external data.