Text Classification with Negative Supervision

Advanced pre-trained models for text representation have achieved state-of-the-art performance on various text classification tasks. However, the discrepancy between the semantic similarity of texts and labelling standards affects classifiers, i.e. leading to lower performance in cases where classifiers should assign different labels to semantically similar texts. To address this problem, we propose a simple multitask learning model that uses negative supervision. Specifically, our model encourages texts with different labels to have distinct representations. Comprehensive experiments show that our model outperforms the state-of-the-art pre-trained model on both single- and multi-label classifications, sentence and document classifications, and classifications in three different languages.


Introduction
Text classification generally consists of two processes: an encoder that converts texts to numerical representations and a classifier that estimates hidden relations between the representations and class labels. The text representations are generated using N -gram statistics (Wang and Manning, 2012), word embeddings (Joulin et al., 2017;, convolutional neural networks (Kalchbrenner et al., 2014;Zhang et al., 2015;, and recurrent neural networks (Yang et al., 2016(Yang et al., , 2018. Recently, powerful pre-trained models for text representations, e.g. Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019), have shown stateof-the-art performance on text classification tasks using only the simple classifier of a fully connected layer.
However, a problem occurs when a classification task is adversarial to text encoders. Encoders aim to represent the meanings of texts; hence, seman-

Sentence
Label BERT A cold is a legit disease.
-Cold Oh my god! I caught a cold! Cold Cold Table 1: Examples of BERT classification for labelling a disease contracted by a writer. Both sentences are about the common cold. Only the second example indicates that the writer had a cold. BERT misclassified the first sentence.
tically similar texts tend to have closer representations. Meanwhile, a classifier should distinguish subtle differences that lead to different label assignments, although the texts are semantically similar. Table 1 shows an example of classification results using BERT for the MedWeb dataset . This task requires the labelling of a disease contracted by the writer of a text. Although both texts in Table 1 refer to the common cold, only the second example implies that the writer had a cold. BERT mistakenly labelled both texts as Cold 1 , likely owing to their semantic relatedness. When the standard of class label assignments disagrees with the semantic similarity, the classifier tends to be error-prone owing to the excessive effects of the semantic similarity.
To address this problem, we propose utilizing negative examples, i.e. texts with different labels, to enable negative supervision of the encoder for generating distinct representations for each class. In this study, we design a simple multitask learning model that trains two models simultaneously with a shared text encoder. The first model learns an ordinary classification task (herein referred to as the main task). Meanwhile, the second model encourages representations with different class labels to be distinct (herein referred to as the auxiliary Encoder I caught a cold.
A cold is a legit disease I'm coughing

Classifier Discriminator
Cold Main Task Auxiliary Task Figure 1: Our model consists of a classifier, discriminator, and shared text encoder. The main task learns classification, while the auxiliary task gives negative supervision to generate distinct representations for sentences with different labels. task). We empirically show the effectiveness of our model using the following standard benchmarks of five single-label and four multi-label classification datasets. This study has two main contributions.
• Our multi-tasking learning model consistently outperforms the state-of-the-art model in terms of both single and multi-label classifications, sentence and document classifications, and classifications in three languages.
• Our model is simple and easily applicable to any text encoders and classifiers.
2 Multitask Learning Framework Figure 1 shows an overview of our multitask learning framework that consists of main and auxiliary tasks. Herein, we refer to the model for the main task as a classifier and the model for the auxiliary task as a discriminator. The overall loss function L sums the loss of the main task L m and that of the auxiliary task L a : The classifier and discriminator share and jointly optimize the text encoder, which encodes an input text into a d-dimensional vector v ∈ R d . In this paper, we use the terms of text and representation interchangeably when the intention is obvious from the context.

Main Task
The main task is the primary classification task to optimize. We use a simple classifier as employed in BERT. The classifier takes an input vector v m and calculates probabilities p ∈ R |C| to assign a set of class labels C: where W ∈ R |C|×d and b ∈ R |C| are parameters of the classifier, in which | · | counts the number of elements in a set.
For g, we employ a softmax function for singlelabel classification and a sigmoid function for multilabel classification. In both cases, L m is a negative log-likelihood of predictions.

Auxiliary Task
The auxiliary task aims to give negative supervision to encourage distinct representations of texts with different labels. The discriminator samples a set of n texts v a 1 , . . . , v a n from the same batch as v m , all of which have different labels from v m .
To encourage these texts to have distinct representations, we designed the loss function L a as where the cossim function computes the cosine similarity between the representations. This loss function intuitively encourages the negative examples to have smaller cosine similarities.

Experiments
We conducted a comprehensive evaluation to investigate the performance of our model in terms of (a) single-and multi-label classifications, (b) sentence-and document-level classification, and (c) different languages. We collected the standard evaluation datasets from heterogeneous sources, as summarised in Table 2.

Single-Label Classification
As datasets assigned single labels to sentences, we used the following datasets from the SentEval (Conneau and Kiela, 2018) 2 benchmark.
MR Binary classification of sentiment polarity of movie reviews.  CR Binary classification of sentiment polarity of product reviews.
SST-5 Multi-class classification of the finegrained sentiment polarity of movie reviews. Labels are Positive, Somewhat Positive, Neutral, Somewhat Negative, and Negative. Because the MR, CR, and SUBJ datasets do not separate validation and test sets, we split 20% of each dataset for testing and 20% of the remainder for validation. The evaluation metric for these single-label classification tasks is accuracy.

Multi-Label Classification
We used the NTCIR-13 MedWeb  and arXiv datasets (Yang et al., 2018) for multi-label classification.
MedWeb Assigning disease labels that a writer of a sentence contracted. 4 arXiv Classification of areas of abstracts extracted from papers in the computer science field. 5 Because the arXiv dataset released by Yang et al. (2018) removed all line breaks, we created one ourselves. We collected abstracts and categories of papers submitted to arXiv from January 1st, 2019 to June 4th, 2019 using arXiv API. 6 The evaluation metric for multi-label classification is Exact-Match.
where y i andŷ i are one-hot vectors of gold and predicted labels, respectively, and I(x) takes 1 when x is true and takes 0 otherwise. M is the size of a test set.

Settings
As a text encoder, we employed BERT and a Hierarchical Attention Network (HAN) (Yang et al., 2016) for generating sentence and document representation, respectively. For BERT, we used the pre-trained BERT-base 7 (d = 768). We implemented the HAN following Yang et al. (2016) who used the bi-directional Gated Recurrent Unit as the encoder with the hidden size of 50 (d = 50). The embedding layer of the HAN was initialised using CBOW (Mikolov et al., 2013) embeddings (with dimensions of 200), which were trained using negative sampling on the training and development sets of each task. For systematic comparison, we investigated the performance of the following models. As a baseline, we compared models that conduct only the main task (referred to as Baseline), which corresponds to the fine-tuned BERT-base for sentence classification and the original HAN for document classification. Note that this BERT baseline significantly outperforms previous state-of-the-art methods, which were also compared in the experiment. To investigate the effects of negative supervision at   the auxiliary task, we compared our model to one that predicts a sentence with the same label. Accurately, this model conducts classification given cosine similarities using cross entropy loss (referred to as ACE (the auxiliary task with cross entropy loss)). Furthermore, we evaluated two variations of our model. The first purely gives negative supervision, i.e., the auxiliary task only encourages the generation of distinct representation to negative examples, as described in Section 2.2 (referred to as AAN (the auxiliary task using all negative examples)). The second uses the following margin-based loss as L a with a positive example as well as negative examples: where the k-th sample is selected to have the same label as the input v m to the main task and δ is the margin empirically set to 0.4 (referred to as AM (the auxiliary task with the margin-based loss)). The intuition is that texts with the same label should have more similar representations than negative examples.
We set the batch size of the main task to 16 and set n to four in the auxiliary task, which performed best on the validation set of the MR task. We used early stopping to cease training when the validation score did not improve for 10 epochs. The optimization algorithm used was Adam (Kingma and Ba, 2015) with β 1 = 0.999 and β 2 = 0.9. For each task, we selected the best learning rate among 1e − 5, 3e − 5, and 5e − 5 using the validation set. To alleviate randomness owing to initialization, we reported average scores of 10 time trials excluding the best and worst results. Table 3 shows the performance of all compared methods as well as the performance of the previous state-of-the-art methods (referred to as SOTA). The results in Table 3 indicate that our models of AM and AAN consistently outperform the strong Baselines on both single-label and multi-label classifications, sentence and document classifications, and classifications in different languages. Most notably, our models are effective even for multi-label classification, which is more challenging than its single-label counterpart. In general, AAN achieved greater performance than AM. However, their effectiveness turned out to be task-dependent.

Results
Unlike AM and AAN, ACE degraded the performance of the Baseline except for the MedWeb Japanese task. This result shows that simple multitask learning is ineffective and that our design using negative supervision is crucial.
SST-5 is an exception wherein our models degraded the performance of the Baseline. We hypothesise that this is because its class labels are gradational, e.g. Somewhat Negative is closer to Negative rather than Positive sentences. AM and AAN treat all negative examples equally, disregarding variables, such as relations between class labels. Future work should focus on the semantic relations among class labels in the auxiliary task.

Related Work
Multitask learning has been employed to improve the performance of text classification (Liu et al., 2019;Xiao et al., 2018). Previous studies aimed to improve multiple tasks; hence, they required multiple sets of annotated datasets. In contrast, our method does not require any extra labelled datasets and is easily applicable to various classification tasks.
The methods proposed in Arase and Tsujii (2019) and Phang et al. (2018) improved the BERT classification performance by further training the pre-trained model using natural language inference and paraphrase recognition. Similar to multitask learning, both methods require an additional largescale labelled dataset. Furthermore, these previous studies revealed that the similarity of tasks in training affects the models' final performance (Xiao et al., 2018;Arase and Tsujii, 2019). Our method achieved consistent improvements across tasks, indicating its wider applicability.

Conclusion
In this paper, we proposed a simple multitask learning model that uses negative supervision to generate distinct representations for texts with different labels. Comprehensive evaluation empirically confirmed that our model consistently outperformed BERT and HAN models on single-and multi-label classifications, sentence and document classifications, and classifications in different languages. Our multitask learning model provides a general framework that is easily applicable to existing text classification models.
In future work, we will examine semantic relations between class labels in the auxiliary task. Moreover, we will adapt our model to text generation tasks. We expect that our model will encourage a generation model to generate texts with different labels, such as styles, have distinct representations, which will result in class specific expressions.   Table 5 lists all the labels defined in the Med-Web dataset. The same label set was used for all Japanese, English, and Chinese tasks. The Med-Web task requires to estimate if a writer of text contracted diseases and had symptoms in Table 5. When the writer does not have any of these, the text is allowed to have no label.

A.2 Labels in MedWeb Dataset
Runnynose Cough Influenza Diarrhea Hayfever Fever Headache Cold  Table 6 lists labels used in our arXiv dataset, which are sub-areas in the computer science field. The arXiv is a document level and multi-label classification task. It requires predicting all areas that a paper belongs from its abstract.  Table 6: Labels in arXiv dataset