Part-of-Speech Tagging for Twitter with Adversarial Neural Networks

In this work, we study the problem of part-of-speech tagging for Tweets. In contrast to newswire articles, Tweets are usually informal and contain numerous out-of-vocabulary words. Moreover, there is a lack of large scale labeled datasets for this domain. To tackle these challenges, we propose a novel neural network to make use of out-of-domain labeled data, unlabeled in-domain data, and labeled in-domain data. Inspired by adversarial neural networks, the proposed method tries to learn common features through adversarial discriminator. In addition, we hypothesize that domain-specific features of target domain should be preserved in some degree. Hence, the proposed method adopts a sequence-to-sequence autoencoder to perform this task. Experimental results on three different datasets show that our method achieves better performance than state-of-the-art methods.


Introduction
During the last decade, social media have become extremely popular, on which billions of usergenerated contents are posted every day. Many users have been writing about their thoughts and lives on the go. The massive unstructured data from social media provides valuable information for a variety of applications such as stock prediction (Bollen et al., 2011), public health analysis (Wilson and Brownstein, 2009;Paul and Dredze, 2011), real-time event detection (Sakaki et al., 2010), and so on. The quality of these applications is highly impacted by the performance of natural language processing tasks. * Corresponding author.  Part-of-speech (POS) tagging is one of the most important natural language processing tasks. It has also been widely used in the social media analysis systems (Ritter et al., 2012;Lamb et al., 2013;Kiritchenko et al., 2014). Most stateof-the-art POS tagging approaches are based on supervised methods. Hence, they usually require a large amount of annotated data to train models. Many datasets have been constructed for POS tagging task. Because newswire articles are carefully edited, benchmarks usually use them for annotation (Marcus et al., 1993). However, usergenerated contents on social media are usually informal and contain many nonstandard lexical items. Moreover, the difference in domains between training data and evaluation data may heavily impact the performance of approaches based on supervised methods (Caruana and Niculescu-Mizil, 2006). Hence, most POS tagging methods cannot achieve the same performance as reported on newswire domain when applied on Twitter (Owoputi et al., 2013).
To perform the Twitter POS tagging task, some approaches have been proposed to perform the task. Gimpel et al. (2011) manually annotated 1,827 tweets and carefully studied various fea-tures. Ritter et al. (2011) also constructed a labeled dataset, which contained 787 tweets, to empirically evaluate the performance of supervised methods on Twitter. Owoputi et al. (2013) incorporated word clusters into the feature sets and further improved the performance. From these works, we can observe that the size of the training data was much smaller than the newswire domain's.
Besides the challenge of lack of training data, the frequent use of out-of-vocabulary words also makes this problem difficult to address. Social media users often use informal ways of expressing their ideas and often spell words phonetically (e.g., "2mor" for "tomorrow"). In addition, they also make extensive use of emoticons and abbreviations (e.g., ":-)" for smiling emotion and "LOL" for laughing out loud). Moreover, new symbols, abbreviations, and words are constantly being created. Figure 1 shows an example of tagged Tweet.
To tackle the challenges posed by the lack of training data and the out-of-vocabulary words, in this paper, we propose a novel recurrent neural network, which we call Target Preserved Adversarial Neural Network (TPANN) to perform the task. It can make use of a large quantity of annotated data from other resourcerich domains, unlabeled in-domain data, and a small amount of labeled in-domain data. All of these datasets can be easily obtained. To make use of unlabeled data, motivated by the work of Goodfellow et al. (2014) and Chen et al. (2016), the proposed method extends the bi-directional long short-term memory recurrent neural network (bi-LSTM) with an adversarial predictor. To overcome the defect that adversarial networks can merely learn the common features, we propose to use an autoencoder only acting on target dataset to preserve its own specific features. For tackling the out-of-vocabulary problem, the proposed method also incorporates a character level convolutional neutral network to leverage subword information.
The contributions of this work are as follows: • We propose to incorporate large scale unlabeled in-domain data, out-of-domain labeled data, and in-domain labeled data for Twitter part-of-speech tagging task.
• We introduce a novel recurrent neural network, which can learn domain-invariant rep-resentations through in-domain and out-ofdomain data and construct a cross domain POS tagger through the learned representations. The proposed method also tries to preserve the specific features of target domain.
• Experimental results demonstrate that the proposed method can lead to better performance in most of cases on three different datasets.

Approach
In this work, we propose a novel recurrent neural network, Target Preserved Adversarial Neural Network (TPANN), to learn common features between resource-rich domain and target domain, simultaneously to preserve target domain-specific features. It extends the bi-directional LSTM with adversarial network and autoencoder. The architecture of TPANN is illustrated in Figure 2. The model consists of four components: Feature Extractor, POS Tagging Classifier, Domain Discriminator and Target Domain Autoencoder. In the following sections, we will detail each part of the proposed architecture and training methods.

Feature Extractor
The feature extractor F adopts CNN to extract character embedding features, which can tackle the out-of-vocabulary word problem effectively.
To incorporate word embedding features, we concatenate word embedding to character embedding as the input of bi-LSTM on the next layer. Utilizing a bi-LSTM to model sentences, F can extract sequential relations and context information.
We denote the input sentence as x and the i-th word as x i . x i ∈ S(x) and x i ∈ T (x) represent input samples are from source domain and target domain, respectively. We denote the parameters of F as θ f . Let V be the vocabulary of words, and C be the vocabulary of characters. d is the dimensionality of character embedding then Q ∈ R d×|C| is the representation matrix of vocabulary. We assume that word x i ∈ V is made up of a sequence of characters C i = [c 1 , c 2 , . . . , c l ], where l is the max length of word and every word will be padded to this length. Then C i ∈ R d×l would be the inputs of CNN.
We apply a narrow convolution between C i and filter H ∈ R d×k , where k is the width of the filter. After that we add a bias and apply nonlinearity to obtain a feature map m i ∈ R l−k+1 . Specifically, the j-th element of m i is given by: is the Frobenius inner product. We then apply a max-over-time pooling operation (Collobert et al., 2011) over the feature map. CNN uses multiple filters with varying widths to obtain the feature vector c i for word x i . Then, the character-level feature vector c i is concatenated to the word embedding w i to form the input of bi-LSTM on the next layer. The word embedding w is pretrained on 30 million tweets. Then, the hidden states h of bi-LSTM turn into the features that will be transfered to P, Q and R, i.e. F(x) = h.

POS Tagging Classifier and Domain Discriminator
POS tagging classifier P and domain discriminator Q take F(x) as input. They are standard feed-forward networks with a softmax layer for classification. P predicts POS tagging label to get classification capacity, and Q discriminates domain label to make F(x) domain-invariant. The POS tagging classifier P maps the feature vector F(x i ) to its label. We denote the parameters of this mapping as θ y . The POS tagging classifier is trained on N s samples from the source domain with the cross entropy loss: where y i is the one-hot vector of POS tagging label corresponding to x i ∈ S(x),ŷ i is the output of top softmax layer:ŷ i = P(F(x i )).
During the training time, The parameters θ f and θ y are optimized to minimize the classification loss L task . This ensures that P(F(x i )) can make accurate prediction on the source domain. Conversely, domain discriminator maps the same hidden states h to the domain labels with parameters θ d . The domain discriminator aims to discriminate the domain label with following loss function: where d i is the ground truth domain label for sample i,d i is the output of top layer:d i = Q(F(x i )). N t means N t samples from the target domain. The domain discriminator is trained towards a saddle point of the loss function through minimizing the loss over θ d while maximizing the loss over θ f (Ganin et al., 2016). Optimizing θ f ensures that the domain discriminator can't discriminate the domain, i.e., the feature extractor finds the common features between the two domains.

Target Domain Autoencoder
Through training adversarial networks, we can obtain domain-invariant features h common , but it will weaken some domain-specific features which are useful for POS tagging classification. Merely obtaining domain invariant features would therefore limit the classification ability.
Our model tries to tackle this defect by introducing domain-specific autoencoder R, which attempts to reconstruct target domain data. Inspired by (Sutskever et al., 2014) but different from (Dai and Le, 2015), we treat the feature extractor F as encoder. In addition, we combine the last hidden states of the forward LSTM and backward LSTM in F as the initial state h 0 (dec) of the decoder LSTM. Hence, we don't need to reverse the order of words of the input sentences and the model avoids the difficulty of "establish communication" between the input and the output (Sutskever et al., 2014).
Similar to (Zhang et al., 2016), we use h 0 (dec) and embedding vector of the previous word as the inputs of the decoder, but in a computationally more efficient manner by computing previous word representation.
We assume that (x 1 , · · · ,x T ) is the output sequence. z t is the t-th word representation: z t = M LP (h t ), and M LP is the multiple perceptron function. Hidden state is the concatenation operation. We estimate the conditional probability p(x 1 , · · · ,x T |h 0 (dec)) as follows: where each p(x t |h 0 (dec), z 1 , · · · , z t−1 ) distribution is computed with softmax over all the words in the vacabulary. Our aim is to minimize the following loss function with respect to parameters θ r : where x i is the one-hot vector of i-th word. This makes h 0 (dec) learn an undercomplete and most salient sentence representation of target domain data. When the adversarial networks try to optimize the hidden representation to common representation h common , The target domain autoencoder counteracts a tendency of the adversarial network to erase target domain features by optimizing the common representation to be informative on the target-domain data.

Training
Our model can be trained end-to-end with standard back-propagation, which we will detail in this section.
Our ultimate training goal is to minimize the total loss function with parameters {θ f , θ y , θ r , θ d } as follows: where α, β, γ are the weights to balance the effects of P, R and Q.
For obtaining domain-invariant representation h common , inspired by (Ganin and Lempitsky, 2015), we introduce a special gradient reversal layer (GRL), which does nothing during forward propagation, but negates the gradients if it receives backward propagation, i.e. g(F(x)) = F(x) but ∇g(F(x)) = −λ∇F(x). We insert the GRL between F and Q, which can run standard Stochastic Gradient Descent with respect to θ f and θ d . The parameter −λ drives the parameters θ f not to amplify the dissimilarity of features when minimize L tpye . So by introducing a GRL, F can drive its parameters θ f to extract hidden representations that help the POS tagging classification and hamper the domain discrimination.
In order to preserve target domain-specific features, we only optimize the autoencoder on target domain data for reconstruction tasks.
Through above procedures, the model can learn the common features between domains, simultaneously preserve target domain-specific features. Finally, we can update the parameters as follows: where µ is the learning rate. Because the size of the WSJ is more than 100 times that of the labeled Twitter dataset, if we directly train the model with the combined dataset, the final results are much worse than those using two training steps. So, we adopt adversarial training on WSJ and unlabeled Twitter dataset at the first step, then use a small number of in-domain labeled data to fine-tune the parameters with a low learning rate.

Experiments
In this section, we will detail the datasets used for experiments and experimental setup.

Datasets
The methods proposed in this work incorporate out-of-domain labeled data from resource-rich domains, large scale unlabeled in-domain data, and a small number of labeled in-domain data. The datasets used in this work are as follows: Labeled out-of-domain data. We use a standard benchmark dataset for adversarial POS tagging, namely the Wall Street Journal (WSJ) data from the Penn TreeBank v3 (Marcus et al., 1993), sections 0-24 for the out-of-domain data. Labeled in-domain data. For training and evaluating POS tagging approaches, we compare the proposed method with other approaches on three benchmarks: RIT-Twitter (Ritter et al., 2011), NPSCHAT (Forsyth, 2007, and ARK-Twitter (Gimpel et al., 2011). Unlabeled in-domain data. For training the adversarial network, we need to use a dataset that has large scale unlabeled tweets. Hence, in this work, we construct large scale unlabeled data (UNL), from Twitter using its API.
The detailed data statistics of the datasets used in this work are listed in Table 1.

Experimental Setup
We select both state-of-the-art and classic methods for comparison, as follows: • Stanford POS Tagger: Stanford POS Tagger is a widely used tool for newswire domains (Toutanova et al., 2003). In this work, we train it using two different sets, the WSJ (sections 0-18) and a WSJ, IRC, and Twitter mixed corpus. We use Stanford-WSJ and Stanford-MIX to represent them, respectively.
• T-POS: T-Pos (Ritter et al., 2011) adopts the Conditional Random Fields and clustering algorithm to perform the task. It was trained from a mixture of hand-annotated tweets and existing POS-labeled data.
• GATE Tagger: GATE tagger (Derczynski et al., 2013) is based on vote-constrained bootstrapping with unlabeled data. It combines cases where available taggers use different tagsets.
• ARK Tagger: ARK tagger (Owoputi et al., 2013) is a system that reports the best accuracy on the RIT dataset. It uses unsupervised word clustering and a variety of lexical features.
• bi-LSTM: Bidirectional Long Short-Term Memory (LSTM) networks have been widely used in a variety of sequence labeling tasks (Graves and Schmidhuber, 2005). In this work, we evaluate it at character level, word level, and combining them together. bi-LSTM (word level) uses one layer of bi-LSTM to extract word-level features and adopts a random initialization method to transform words to vectors. bi-LSTM (character level) represents a method that combines bi-LSTM and CNN-based character embedding, a similar approach with character-aware neural network described in (Kim et al., 2015) to handle the out-ofvocabulary words. bi-LSTM (word level pretrain) architecture is the same as that of bi-LSTM(word level) but adopts word2vec tool (Mikolov et al., 2013) to vectorize. bi-LSTM (combine) concatenates word to character features.

Methods
RIT-Test RIT-Dev Stanford-WSJ (Toutanova et al., 2003) 73.37% 83.29% Stanford-MIX 83.14% 84.19% T-POS (Ritter et al., 2011) 84.55% 84.83% GATE Tagger (Derczynski et al., 2013) 88.69% 89.37% ARK Tagger (Owoputi et al., 2013) 90  The hyper-parameters used for our model are as follows. AdaGrad optimizer trained with crossentropy loss is used with 0.1 as the default learning rate. The dimensionality of word embedding is set to 200. The dimensionality for random initialized character embedding is set to 25. We adopt a bi-LSTM for encoding with each layer consisting of 250 hidden neurons. We set three layers of standard LSTM for decoding. Each LSTM layer consists of 500 hidden neurons. Adam optimizer trained with cross-entropy loss is used to fine-tune with 0.0001 as the default learning rate. Finetuning is run for 100 epochs using early stop.

Results and Discussion
In this section, we will report experimental results and a detailed analysis of the results for the three different datasets.

Evaluation on RIT-Twitter
The RIT-Twitter is split into training, development and evaluation sets (RIT-Train, RIT-Dev, RIT-Test). The splitting method is shown in (Derczynski et al., 2013), and the dataset statistics are listed in Table 1. Table 2 shows the results of our method and other approaches on the RIT-Twitter dataset. RIT-Twitter uses the PTB tagset with several Twitter-specific tags: retweets, @usernames, hashtags, and urls. Since words in these categories can be tagged almost perfectly using simple regular expressions, similar to (Owoputi et al., 2013), we use regular expressions to tags these words appropriately for all systems.
From the results of the Stanford-WSJ, we can observe that the newswire domain is different from Twitter. Although the token-level accuracy of the Stanford POS Tagger is higher than 97.0% on the PTB dataset, its performance on Twitter drops sharply to 73.37%. By incorporating some indomain labeled data for training, the accuracy of Stanford POS Tagger can reach up to 83.14%. Taking a variety of linguistic features and many other resources into consideration, the T-POS, GATE tagger, and ARK tagger can achieve better performance.
The second part of Table 2 shows the results of the bi-LSTM based methods, which are trained on the RIT-Train dataset. According to the results of word level, we can see that word2vec can provide valuable information. The pre-trained word vectors in bi-LSTM(word level pretrain) give almost 10% higher accuracy than bi-LSTM(word level).
Comparing the character-level bi-LSTM with word-level bi-LSTM with random initialization, we can observe that the character-level method can achieve better performance than the word-level method. bi-LSTM(combine) combines word with character features, as described in Section 2.1, which achieves the best results at 89.48% in the bi-LSTM based baseline systems and shows that the morphological features and pre-trained word vectors are both useful for POS tagging.
The third part of Table 2 shows the results of our methods incorporating out-of-domain labeled data, in-domain unlabeled data, and in-domain labeled data. Putting everything together, our model can achieve 90.92% on this dataset. Compared with the architecture without an adversarial model, our method is almost 1% better. It demonstrates that adversarial networks can significantly help with tasks of this nature. Through introducing the autoencoder in target domain, we can preserve domain-specific features for better performance. Compared with the ARK tagger, which achieves the previous best result on this dataset, our model is also 0.52% better, the error reduction rate is more than 5.5%.
To better understand why adversarial networks can help transfer domains from newswire to Twitter, in this work we also followed the method Ganin and Lempitsky (2015) used to visualize the outputs of LSTM with t-SNE (Van Der Maaten, 2013). Figure 3 shows the visualization results. From the figure, we can see that the adversary in our method makes the two distributions of features much more similar, which means that the outputs of bi-LSTM are domain-invariant. Hence, the PTB training data can provide much more help than directly combining PTB and RIT-Train together.

Evaluation on NPSChat
IRC, which contains Internet relay room messages from 2006, is a medium of online conversational text. Its content is very similar to tweets. We evaluate the proposed method on the NPSChat corpus (Forsyth, 2007), a PTB-part-of-speech annotated dataset of IRC. We compared our method with a tagger in the same setup as experiments with (Forsyth, 2007). The training part contains 90% of the data. The testing part contains the other 10%. Table 3 shows the results of the ARK Tagger and our method. We used PTB, unlabeled Twitter, and the training part of NPSChat to train our model. From the results, we can see that our model achieved 94.1% accuracy. This is significantly better than the result Forsyth (2007) reported, which was 90.8%. They trained their tagger with a mix of several POS-annotated corpora (12K from Twitter, 40K from IRC, and 50K from PTB). Our method also outperforms state-of-the-art results 93.4%±0.3%, which was achieved by the ARK Tagger with various external corpus and features, e.g., Brown clustering, PTB, Freebase lists of celebrities, and video games. Gimpel et al. (2011)

Evaluation on ARK-Twitter
ARK-Twitter data contains an entire dataset consisting of a number of tweets sampled from one particular day (October 27, 2010) described in (Gimpel et al., 2011). This part is used for training. They also created another dataset, which consists of 547 tweets, for evaluation (DAILY547). This dataset consists of one random English tweet from every day between January 1, 2011 and June 30, 2012. The distribution of training data may be slightly different from the testing data, for example a substantial fraction of the messages in the training data are about a basketball game. Since ARK-Twitter uses a different tagset with PTB, we manually construct a table to link tags for the two datasets. Table 4 shows the results of different methods on this dataset. From the results, we can see that our method can achieve a better result than (Gimpel et al., 2011). However, the performance of our method is worse than the ARK Tagger. Through analyzing the errors, we find that 16.7% errors occurr between nouns and proper nouns. Since our method do not include any ontology or knowledge, proper nouns can not be easily detected. However, the ATK Tagge add a tokenlevel name list feature. The name list is useful for proper nouns recognition, which fires on names from many sources, such as Freebase lists of celebrities, the Moby Words list of US Locations, proper names from Mark Kantrowitz's name corpus and so on. So, our model is also competitive when lacking of manual feature knowledge.

Related Work
Part-of-Speech tagging is an important preprocessing step and can provide valuable information for various natural language processing tasks.
In recent years, deep learning algorithms have been successfully used for POS tagging. A number of approaches have been proposed and have achieved some progress. Santos and Guimaraes (2015) proposed using a character-based convolutional neural network to perform the POS tagging problem. Bi-LSTMs with word, character or unicode byte embedding were also introduced to achieve the POS tagging and named entity recognition tasks (Plank et al., 2016;Chiu and Nichols, 2015;Ma and Hovy, 2016). In this work, we study the problem from a domain adaption perspective. Inspired by these works, we also propose to use character-level methods to handle out-ofvocabulary words and bi-LSTMs to model the sequence relations.
Adversarial networks were successfully used for image generation (Goodfellow et al., 2014;Dosovitskiy et al., 2015;Denton et al., 2015), domain adaption (Tzeng et al., 2014;Ganin et al., 2016), and semi-supervised learning (Denton et al., 2016). The key idea of adversarial networks for domain adaption is to construct invariant features by optimizing the feature extractor as an adversary against the domain classifier (Zhang et al., 2017).
Sequence autoencoder reads the input sequence into a vector and then tries to reconstruct it. Dai and Le (2015) used the model on a number of different tasks and verified its validity. Li et al. (2015) introduced the model to hierarchically build an embedding for a paragraph, showing that the model was able to encode texts to preserve syntactic, semantic, and discourse coherence.
In this work, we incorporate adversarial networks with autoencoder to obtain domaininvariant features and keep domain-specific features. Our model is more suitable for target domain tasks.

Conclusion
In this work, we propose a novel adversarial neural network to address the POS tagging problem. Besides learning common representations between source domain and target domain, it can simultaneously preserve specific features of target domain. The proposed method leverages newswire resources and large scale in-domain unlabeled data to help POS tagging classification on Twitter, which has a few of labeled data. We evaluate the proposed method and several state-ofthe-art methods on three different corpora. In most of the cases, the proposed method can achieve better performance than previous methods. Experimental results demonstrate that the proposed method can make full use of these resources, which can be easily obtained.