Data Augmentation Based on Distributed Expressions in Text Classification Tasks

We propose a data augmentation method that combines Doc2vec and Label spreading in text classiﬁcation tasks. The feature of our approach is the use of unlabeled samples, which are easier to obtain than labeled samples. We use them as an aid to the classiﬁcation model to improve the accuracy of its prediction. We used this method to classify several text data sets including the natural language branch of the AIWolf contest. As a result of the experiments, we conﬁrmed that the prediction accuracy is improved by applying our proposed method.


Introduction
Analyzing human intentions in texts is a task in high demand in natural language processing. On the other hand, to solve this task well, it is necessary to prepare an enormous amount of natural language corpora that the intentions of each text are labeled. In particular, if the context is unusual, like in-game conversations, the preprocessed training data that meets the demand is rarely available. Thus we have to manually label intentions one by one or pay for crowdsourcing.
To cope with this situation, we propose a method that can estimate the intention of texts with high accuracy from a large number of unlabeled samples and a relatively small amount of labeled ones.

Data augmentation via unlabeled samples
There are several existing methods for performing data augmentation based on unlabeled samples. In S-EM (Nigam et al., 2000), a naive Bayes model is first constructed using only labeled samples. The trained naive Bayes model gives unlabeled samples an estimated probability of their label. Then, a new naive Bayes model is constructed using all the samples, both originally labeled and newly labeled. As with the EM algorithm, this procedure is repeated until the parameters of the model converge.
Many of the related methods involve minor changes to S-EM, such as replacing the algorithm used in intermediate steps with a more accurate one (Li and Liu, 2003).

Word2vec and Doc2vec
Word2vec (Mikolov et al., 2013) is a method that expresses a word as a distributed representation with a high dimensional vector. The regularity of addition and subtraction is shown by vector representation of words such that vector('king') -vector('man') + vector ('woman') approximates vector ('queen'). Word2Vec uses a Bag-of-Words model, which uses the number of occurrences of words in a sentence, and a Skip-gram model, which uses the word occurrence probability from the sequence of words in a sentence.
Doc2vec (Le and Mikolov, 2014) is a method to perform the same operation as Word2vec on a document. It converts a document into a vector representation in high-dimensional space. As with Word2vec, documents that are close in this space can be interpreted as having a similar context.

Label spreading
Label spreading (Zhou et al., 2003) is a semisupervised learning method. The goal of semisupervised learning is to estimate the label of unlabeled samples based on a small number of labeled samples. In label spreading, the label information is propagated from the labeled sample to the unlabeled sample at a close distance. This newly labeled sample also has a influence on the surrounding sample. By repeating this propagation, the label information of labeled samples is spread for all samples.

Our proposed method
We propose a method to estimate the true label of the documents with high accuracy but from a relatively small amount of labeled data.
The model training process is as follows. First, we perform a word segmentation via morphological analysis on all the documents to obtain an ordered list of words. This operation is peculiar to the Japanese language, which is not normally written with a space between words. For that, it may not be necessary when applying this method to other languages such as English. Based on the result, the Doc2vec model is constructed using both labeled and unlabeled training samples. Thus each sample is made to correspond to the coordinate of the high dimensional space. After that, Label spreading is performed in this space. Labeled samples are used to label all the remaining unlabeled samples. The label information is propagated to surrounding samples in embedding space.
In the prediction process, we input the natural language document to the previously trained Doc2vec model to get the vector representation of the sample in the high dimensional space. The Nearest centroid algorithm (Tibshirani et al., 2002) is performed in this space, which estimates the label of the sample based on the neighboring samples. Finally, the true label of this sample is estimated.
We show this method schematically in Figure  1. (1) Our objective is to estimate the label of the sample embedded in the star position.
(2) If we simply apply the nearest neighbor algorithm by using just labeled samples, the estimation is not reliable. (3) In our proposed method, the label of unlabeled samples is complemented by Label spreading at first. The Nearest centroid algorithm is applied based on both originally and newly labeled samples.

Experimental setting
To verify the effectiveness of the proposed method, we conducted the following experiments. First, we prepared corpora that the intentions are labeled on. Then, we remove the label information from about 90% of the datasets. We trained the Doc2vec model with both labeled and unlabeled data, then use it to embed all samples to high dimensional space. After that, we performed Label spreading to recover label information. For comparison, we also prepared a model that simply executes the Nearest centroid using only the labeled data. Finally, we input the corpora not used for training and compared the prediction accuracy of the true label.
For Label spreading and Nearest centroid, we used the implementations of scikitlearn (Pedregosa et al., 2011).

Datasets
The following three corpora were used in this experiment.
Livedoor consists of documents published in an online news site. We labeled the topic category in which the news appears. There are nine categories such as "sports", "life hacks" and so on. Our purpose is to estimate the topic category from a news article.  WolfBBS consists of utterances generated by humans on Werewolf BBS, an online BBS for playing the Werewolf game. Nine intentions are defined such as "COMING OUT", "DIVINE RE-SULT", and so on. Each utterance is annotated one of nine intentions.
AIWolfNLP consists of the utterances in the natural language branch of the 4th AIWolf Contest. We labeled the intention of each utterance generated in the TALK phase. We defined 10 intentions that seem to be useful in understanding the game situation such as "DIVINED WERE-WOLF", "REQUEST VOTE", and so on. Examples of the correspondence between each text and assigned label are shown in Table 1. Our purpose is to estimate the intention of the utterance. In this dataset, just one agent's utterances are labeled and others are unlabeled. This is a setting that assumes the case of actually participating in the natural language branch of the AIWolf contest. We have a complete set of utterances and intent pairs for the agents we created, but no information about other agents.
A summary of these datasets is presented in Table 2.

Experimental results
The experimental results for each dataset are shown in Table 3. In each dataset, the proposed method that exploits both labeled and unlabeled samples gained higher prediction accuracy than the method simply applying the Nearest centroid using just labeled samples.

Conclusion
We proposed an effective prediction method for document classification tasks when a large number of unlabeled samples and a few labeled samples are retained. Our experiments demonstrated that the proposed method gained significantly higher prediction accuracy than the model trained on only labeled samples. It is often the case that the text itself is available in large quantities, but only a few samples are labeled. This method will be quite useful in such situations.
As a prospect, we should conduct similar experiments on languages other than Japanese to confirm the usefulness of the method. The object of the experiment was limited to Japanese in this paper, but since this method has no language dependency, it can also be applied to any language.