Yimmon at SemEval-2019 Task 9: Suggestion Mining with Hybrid Augmented Approaches

Suggestion mining task aims to extract tips, advice, and recommendations from unstructured text. The task includes many challenges, such as class imbalance, figurative expressions, context dependency, and long and complex sentences. This paper gives a detailed system description of our submission in SemEval 2019 Task 9 Subtask A. We transfer Self-Attention Network (SAN), a successful model in machine reading comprehension field, into this task. Our model concentrates on modeling long-term dependency which is indispensable to parse long and complex sentences. Besides, we also adopt techniques, such as contextualized embedding, back-translation, and auxiliary loss, to augment the system. Our model achieves a performance of F1=76.3, and rank 4th among 34 participating systems. Further ablation study shows that the techniques used in our system are beneficial to the performance.


Introduction
Suggestion mining is a trending research domain that focuses on the extraction of extract tips, advice, and recommendations from unstructured text.To better recognize suggestions, instead of only matching feature words, one must have the ability to understand long and complex sentences.
SemEval-2019 Task 9 provides the suggestion mining task (Negi et al., 2019).The task can be recognized as a text classification task, given a sentence collected from user feedback, participating systems are required to give a binary output by marking it as suggestion or non-suggestion.
To address this problem, we focus on solving long-term dependency on long and complex sentences.Consequently, we transfer Self-Attention Network (SAN), a successful model in machine reading comprehension field, in which long-term dependency is crucial, into this task.Furthermore, we also utilize multiple techniques to improve the suggestion mining system.

System description
In this paper, we consider suggestion mining as a text classification task.Figure 1 gives an overview of our model.First, the input text is converted into word embeddings with linguistic features.Then, we use several stacked semantic encoders to generate the hidden representations for each token.On top of that, a softmax output layer estimates the probability of the text being a suggestion.

Input encoding
The input encoding layer is in charge of encoding each token of the input text to singular vectors.Tokenization is completed during preprocessing.In our work, we adopt WordPiece embedding (Wu et al., 2016) and feed it into a pretrained language model (LM) to generate contextualized embeddings.Compared to context independent word vectors, such as widely used GloVe (Pennington et al., 2014), SGNS (Mikolov et al., 2013), contextualized vectors show significant advantages in disambiguation and sentence modeling.Besides, well-pretrained language model also transfers external knowledge to this task, full use of transfer learning is the key to the advance in modern neural natural language processing.On the choice of the pretrained language model, we compared two publicly available models ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018), and we finally choose BERT for better performance.Due to the out-of-memory issues1 , we do not update the parameters of BERT during training, thus we only use it as a static feature extractor.Also, linguistic features are extracted to improve system performance further.In this work, we extract part-of-speech (POS) and named entities (NER) by spaCy 2 .Two kinds of part-ofspeech granularity are used, primary POS and extended POS tag (TAG).In order to obtain linguistic feature sequences with the same length as the BERT outputs, we pad zero vectors at the start and end position of the linguistic feature sequences.Then, the contextualized vectors and linguistic feature vectors are concatenated.A two layers highway network (Srivastava et al., 2015) is also adopted on top of this representation.The vectors 2 https://spacy.io are projected to d dimensions immediately.

Model encoder
The model encoder is the central part of our system, and it is in charge of modeling long-term dependency and extracting deep features.Because of the success of Self-Attention Network (SAN) (Shen et al., 2018) in various NLP tasks, we adopt a structure from QANet (Yu et al., 2018), which is a variant of Self-Attention Network, as our model encoder.
As is shown in the middle part of Figure 1, the model encoder is a combination of convolution, self-attention and feed-forward layer.This structure is repeated three times.The input vectors of this structure are firstly added by sinusoidal position embedding (Gehring et al., 2017) to encode a notion of the order in the sequence.The position embedding is calculated as follows: After that there are convolution blocks, following (Yu et al., 2018), depth-wise separable convolution layers are chosen for better generalization and memory efficiency.The model encoder is highly dependent on input layer normalization and residual connections, each block is in a uniform structure: layer normalization / operation / residual connection.In our system, we repeat convolution blocks two times for better and deeper local feature extraction.
In the self-attention layer, the scaled dotproduct attention is computed: Where Q, K, and V are query, key, and value respectively, they are the linear projection of each position in the input.As in (Vaswani et al., 2017), multi-head attention mechanism is adopted which integrates multiple scaled dot-product attentions. where At last, there is a fully connected block.In our method, it is a little different from the original work (Yu et al., 2018).We append a gate mechanism to refine tokens by their importance (Wang et al., 2017). (5) where we assume the input of this block is H ∈ R m×d , m is the sequence length, F F N 2 denotes a 2-layer non-linear feed-forward network, S ∈ R m×d represents the output of this fully connected block.G ∈ R m×1 and G * ∈ R m×1 are the output weight of the gate.W G ∈ R m×1 and b G ∈ R 1 are trainable parameters.The maximum operation aims to select the maximum element in G * , and the division operation normalizes these weights so that the maximum weight in G is always one.

Output layer
Given the output S = [s 1 , s 2 , • • • , s m ] of previous layers, this output layer converts these hidden representations into the final probability.Since we have adopted BERT in the input encoding layer, the first vector of the sequence is a special classification token [CLS], which can be used as the representation of the whole sentence.We split the matrix S and take the first vector s 1 ∈ R d .The probability of the input text being a suggestion text is estimated as follows.
where W p ∈ R 2×d and b p ∈ R 2 are trainable parameters, p ∈ R 2 denotes the output probabilities including "yes" probability p 1 and "no" probability p 0 .

Loss
We treat this task as a text classification problem and use log-loss as the loss function.
where N is the number of examples, y i ∈ {0, 1} represents the label of i-th example, p 1 i and p 0 i are the predictions.
Besides, in order to better recognize important tokens and filter trivial tokens out, we add an auxiliary loss to discourage large weights in G in Equation 4. Thus only those tokens that have contributions to the classification have non-zero weights in G.
where • 1 denote 1-norm, β is a hyper-parameter, we use β = 10 −3 in this work.The final loss is the sum of L 0 and L 1 .

Class imbalance
As is pointed out in (Negi et al., 2019), suggestions appear sparsely in online reviews and forums, and this makes class imbalance a critical problem.For simplicity, we do not take measures during training.In inference, we slightly adjust the predicted probability and divide it by a priori, which is the rate of positive examples in training data.
where p 1 * and p 0 * are the actual predictions for inference, the system outputs "yes" when p 1 * is larger than p 0 * .Other symbols are the same as in Equation 7.

Back-translation
Because the given training data set is not large, we also utilize a data augmentation technique to enrich the training data.The data augmentation technique we used is back-translation (Yu et al., 2018).Specifically, we first translate the given training data into Chinese by a neural machine translation system and then translate it back into English by another neural machine translation system.The two neural machine translation systems are trained on a subset of the WMT18 data sets 3 .Both original training data and the augmented data are applied to train our text classification system, but the augmented data is given a small weight (=0.2) when calculating the loss.

Results
Table 1 shows the main results on the test set.
Compared with the rule baseline, participants improve their performance with substantial gains.
Our system achieved F1=76.3 on the test set and ranked 4th among all 34 teams.In order to evaluate the individual contribution of each feature, we run an ablation study.Contextualized embedding is most critical to the performance, and it concludes that transferring common sense by learning large corpus is vital for this task.Back-translation accounts for about 2% of the performance degradation, which clearly shows the effectiveness of data augmentation.Besides, linguistic features, auxiliary loss, and priori are also beneficial to the system.

Ensemble models
Table 2 reports the effect of ensemble.To obtain the submission predictions, we trained eight models on the eight subsets mentioned in section 3.1.Four of these models are based on a 110M parameters base BERT model, and the other four are based on a 340M parameters large BERT model.The performance of every single model is reported.It is evident that different data partition causes quite a large performance variance.Thus, ensemble is necessary.The result shows that as the number of models increases the ensemble effect improves.Also, we experiment by searching the optimal model combination assuming the test label is known to show the performance limitation (the Oracle performance).

Conclusion
In this work, we adopt multiple techniques to improve a suggestion mining system.The core of our system is a variant of Self-Attention Network (SAN), which originates from the machine reading comprehension field.Based on this model, techniques, such as contextualized embedding, backtranslation, linguistic features, and auxiliary loss, are investigated to improve the system performance further.Experimental results illustrate the effect of our system.Our model achieves a performance of F1=76.3, and rank 4th among 34 participating systems.

Figure 1 :
Figure 1: Overview of our system

Table 1 :
Universal Windows Platform.The data set is split into an 8980-example training set, a 592-example validation set, and an 833-example test set.Since the organizer does not limit the usage of the validation set, we merge it into training data and train our model through the k-fold crossvalidation method.Specifically, we split all training data into eight subsets, and guarantee the rate between the number of positive examples and the number of negative examples is about 1:3 in each subset.The preprocessing process is implemented as the description in section 2.1.The kernel size of convolution layers is 7, and the hidden size d is 256, the number of heads is 8 in multi-head self-attention layers.Adam optimizer (Kingma and Ba, 2014) with learning rate 0.0008 is used for tuning the model parameters.The mini-batch size is 32.For regularization, the dropout rate is set to 0.1.The submission predictions are obtained by integrating eight runs through voting.Performance of the Top 5 systems on the leaderboard of subtask A, and our ablation experiments.

Table 2 :
Performance of every single model and ensemble models.