Zoho at SemEval-2019 Task 9: Semi-supervised Domain Adaptation using Tri-training for Suggestion Mining

This paper describes our submission for the SemEval-2019 Suggestion Mining task. A simple Convolutional Neural Network (CNN) classifier with contextual word representations from a pre-trained language model was used for sentence classification. The model is trained using tri-training, a semi-supervised bootstrapping mechanism for labelling unseen data. Tri-training proved to be an effective technique to accommodate domain shift for cross-domain suggestion mining (Subtask B) where there is no hand labelled training data. For in-domain evaluation (Subtask A), we use the same technique to augment the training set. Our system ranks thirteenth in Subtask A with an F1-score of 68.07 and third in Subtask B with an F1-score of 81.94.


Introduction
Task 9 of SemEval-2019 (Negi et al., 2019) focuses on mining sentences that contain suggestions in online discussions and reviews. Suggestion Mining is modelled as a sentence classification task with two Subtasks: • Subtask A evaluates the classifier performance on a technical domain specific setting.
• Subtask B evaluates the domain adaptability of a model by doing cross-domain suggestion classification on hotel reviews.
We approached this task as an opportunity to test the effectiveness of transfer learning and semisupervised learning techniques. In Subtask A, the high class imbalance and relatively smaller size of the training data made it an ideal setup for evaluating the efficacy of recent transfer learning techniques. Using pre-trained language models for contextual word representations has been shown to improve many Natural Language Processing (NLP) tasks Ruder and Howard, 2018;Radford, 2018;Devlin et al., 2018). This transfer learning technique is also an effective method when less labelled data is available as shown in (Ruder and Howard, 2018). In this work, we use the BERT model (Devlin et al., 2018) for obtaining contextual representations. This results in enhanced scores even for simple baseline classifiers.
Subtask B requires the system to not use manually labelled data and hence it lends itself to a classic semi-supervised learning scenario. Many methods have been proposed for domain adaptation for NLP (Blitzer et al., 2007;Chen et al., 2011;Chen and Cardie, 2018;Zhou and Li, 2005;Blum and Mitchell, 1998). We use a label bootstrapping technique called tri-training (Zhou and Li, 2005) with which unlabelled samples are labelled iteratively with increasing confidence at each training iteration(explained in Section 2.4). Ruder and Plank (2018) shows the effectiveness of tri-training for baseline deep neural models in text classification under domain shift. They also propose a multi-task approach for tri-training, however we only adapt the classic tri-training procedure presented for suggestion mining task.
Detailed explanation of the submitted system and experiments are elicited in the following sections. Section 2 describes the components of the system. Following this, Section 3 details the experiments, results and ablation studies that were performed.

System Description
The models and the training procedures are built using AllenNLP library (Gardner et al., 2018). All the code to replicate our experiments are public and can be accessed from https://github.com/sai-prasanna/ suggestion-mining-semeval19.

Data cleaning and pre-processing
Basic data pre-processing is done to normalize whitespace, remove noisy symbols and accents. Very short sentences with less than four words are disregarded from training.

Word Representations
We use GloVe word representations (Pennington et al., 2014) and compare the performance improvement that we obtain with pre-trained BERT representations (Devlin et al., 2018).

Suggestion Classification
Our baseline classifier is Deep Averaging Network (DAN) (Iyyer et al., 2015). DAN is a neural bagof-words model that is considered as a strong baseline for text classification. In DAN, a sentence representation is obtained by averaging the word level representations and is fed to a series of rectified linear unit (ReLU) layers with a final softmax layer.
A simple Convolutional Neural Network (CNN) text classifier (Kim, 2014) is used for the final submission.

Training
We use the classic tri-training procedure for label bootstrapping as mentioned in (Ruder and Plank, 2018). Consider a labelled dataset L from the source domain S and an unlabelled dataset U from the target domain T . The objective of tri-training is to label U iteratively and augment it with L. Three CNN +BERT classifiers M 1 , M 2 , M 3 are trained separately using subsets of L namely l 1 , l 2 , l 3 respectively. These subsets are obtained from L using bootstrap sampling with replacement.
The above mentioned models are used to predict labels for the unlabelled set U . Predictions which are agreed by two models is considered as a new training example for the third model in the next iteration. For example, an unlabelled sentence U 1 ∈ U is added as a labelled example to l 1 , if and only if the label for U 1 is agreed upon by both M 2 and M 3 . Same way, l 2 is updated with newly labelled data if those labels have been agreed by M 1 and M 3 and so on. This constitutes a single iteration of tri-training. The procedure that is used for the training of our models is mentioned in Algorithm 1.
In this way, the original training data gets added with three newly labelled subsets which are again used for the next training iteration. At the end of each iteration, validation F 1 -score is calculated by using the predictions that are obtained through a majority vote. The procedure is continued until there is no improvement in the validation score.
Algorithm 1 Tri-training end for 10: where p, q = i then 15:

Experiments and Results
This section details the various experiments that were performed using the above components for our submissions.

Data
The test set provided during the trial phase of the evaluation is used as the validation data for all our experiments. For those experiments that do not involve tri-training, we only use the provided labelled data from the technical domain for training.
In Subtask B, for those experiments that involve tri-training, L is the same as mentioned above. U here is obtained in two ways: • Unlabelled data from final test set of Subtask B.
The results reported are mean and confidence intervals of Precision, Recall and F 1 -score over five runs of the same experiments with different random seeds.

Input
For input representations, we use 300d GloVe vectors with dropout (Srivastava et al., 2014) of 0.2 for regularization. We also experiment with the pre-trained BERT base uncased model. The BERT model is not fine-tuned during our training. A dropout of 0.5 is applied for the 768d representations obtained from BERT.

Baseline Deep Averaging Net
Our neural baseline is Deep Averaging Net (DAN) (Section 2.3). When used with GloVe, the hidden sizes of DAN are 300, 150, 75, and 2 respectively. When BERT representations are used, the hidden sizes of the network are 768, 324, 162, and 2 respectively. We report an F 1 -score of 60.82 when DAN is used with BERT in Subtask A and 70.49 in Subtask B. Both these scores are a significant improvements from those obtained with GloVe representations (Table 1). We retain the same configuration of BERT embedding layer for other experiments also. Training is performed with Adam (Kingma and Ba, 2015) optimizer with a learning rate of 1e−3 for all the models.

CNN Classifier
The CNN classifier is composed of four 1-D convolution layers with filter widths ranging from two to five. Each convolutional layer has 192 filters. The output from each layer is max-pooled over sequence (time) dimension. This results in four 192d vectors, which are concatenated to get a 768d output.
The max-pooled outputs are passed through four fully connected feed forward layers with hidden dimensions of 768, 324, 162, and 2 respectively. The intermediate layers use ReLu activation and the final layer is a softmax layer. We use dropout of 0.2 on all layers of the feed forward network except for the final layer.
Without tri-training, this model obtains an absolute improvement of ≈ 4% F 1 -score over DAN in Subtask A. However in Subtask B, it performs poorer than the baseline DAN model with an F 1score of 64.31. This decrease in performance could be because of overfitting on the source domain due to the larger number of parameters in CNN compared to DAN.

Tri-training
The aim of doing tri-training is for domain adaptation by labelling unseen data from a newer domain. For Subtask B, the CNN + BERT model achieves an F 1 -score of 82.19 when trained with the tri-training procedure mentioned in Algorithm 1. Tri-training is used to label the 824 unlabelled sentences from the test set of Subtask B and augmented with the original training data. This score is a huge improvement from the classifier model trained only on the given data which gets an F 1score of 64.31.
We also do the same experiment using 5000 unlabelled sentences from Yelp hotel reviews dataset (Blomo et al., 2013). The model obtains a similar score of 81.98 which proves the importance of tri-training in domain adaptation.
For Subtask A, we get an improvement in the F 1 -score using tri-training, however the increase is not as profound as we observe for Subtask B. We compare the statistical significance of the different models and experiments in Section 3.7.

Upsampling
We also wanted to find how the class balance in the dataset has affected our model performance. The class distribution of the datasets including the test set distribution that was obtained after the final evaluation phase are mentioned in Table 2  The original training data has a class imbalance with only 23% of the sentences labelled as suggestions. We tried to balance the labels by naive upsampling, ie., adding duplicates of sentences that are labelled as suggestions. This allowed us to have a balanced training dataset for our experiments. This resulted in consistent gains over the original dataset during the trial evaluation phase.
However during the final submission, in Subtask A we found that the model's performance in the test set did not correlate well with that of the validation set as shown in Table 1. This could be because the percentage of positive labels in the test set is only 10% while the validation set has 50%.
Experiments without upsampling gives better performance in test set even though there is a decrease in the validation score. For Subtask B however, upsampling has actually increased the model performance. On hindsight, this could be because of similar distribution of class labels in both validation and test sets.
The submitted models received an F 1 -score of 68.07 in Subtask A and 81.03 in Subtask B. Reichart et al. (2018) suggests methods to measure whether two models have statistically significant differences in their predictions on a single dataset. We incorporate a non-parametric testing method for significance called the McNemar's test recommended by them for binary classification. Pairwise comparison of few of our models are reported in Table 3. The table contains the p-values for the null hypothesis. The null hypothesis is that two models do not have significant differences in their label predictions. In simpler words, a small pvalue for an experiment pair denotes a significant difference in the prediction disagreement between two models. For example, from Table 3, DAN + GloVe and DAN + BERT models have a p-value less than 0.05 in both sub-tasks. This indicates that there is significant disagreement between the predictions of two models. Since DAN + BERT gets a better F 1 -score and p < 0.05, we can confidently assert that improvement is not obtained by chance.

Statistical Significance Test
We use majority voting from five random seeds to get the final predictions on the test set for doing the paired significance testing.

Conclusion
We discussed our experiments for doing suggestion mining using tri-training. Tri-training combined with BERT representations proved to be an effective technique for doing semi-supervised learning especially in a cross-domain setting. Future work could explore more optimal ways of doing tri-training, evaluate the effect of contextual representations in tri-training convergence, and try more sophisticated architectures for classification that may include different attention mechanisms. CNN +bert +tritrain T est CNN +bert +tritrain Y elp 0.5862 Table 3: Pairwise comparison of various models using the McNemar's Test. p ≤ 0.05 indicates a significant disagreement between the model predictions.