Transferring from Formal Newswire Domain with Hypernet for Twitter POS Tagging

Part-of-Speech (POS) tagging for Twitter has received considerable attention in recent years. Because most POS tagging methods are based on supervised models, they usually require a large amount of labeled data for training. However, the existing labeled datasets for Twitter are much smaller than those for newswire text. Hence, to help POS tagging for Twitter, most domain adaptation methods try to leverage newswire datasets by learning the shared features between the two domains. However, from a linguistic perspective, Twitter users not only tend to mimic the formal expressions of traditional media, like news, but they also appear to be developing linguistically informal styles. Therefore, POS tagging for the formal Twitter context can be learned together with the newswire dataset, while POS tagging for the informal Twitter context should be learned separately. To achieve this task, in this work, we propose a hypernetwork-based method to generate different parameters to separately model contexts with different expression styles. Experimental results on three different datasets show that our approach achieves better performance than state-of-the-art methods in most cases.


Introduction
With the continuous growth of online communication, hundreds of millions of online conversational messages have become important resources for various applications such as real-time event detection (Sakaki et al., 2010), stock prediction (Bollen et al., 2011) and public health analysis (Wilson and Brownstein, 2009). Because these applications need to process natural language text, POS tagging, which is one of the fundamental natural language processing tasks, has become one of the basic pre-processing components of such applications. The performance of POS RT @jamstik : Lol :) there is more than one way to start living a greener life.
As their varied strategies suggest , there is more than one way to respond to a disaster.

Wall Street Journal Section of Penn Treebank
Treebank-3 (LDC1999T42) /07/WSJ_0799.POS Figure 1: Examples of WSJ and tweets. Segments with red highlights can be regarded as the similar expressions. Segments with blue highlights correspond to expressions that cannot be learned from WSJ. tagging may highly impact the results of these applications.
Most of the POS tagging methods that can achieve state-of-the-art performance are based on supervised learning algorithms (Gimpel et al., 2011). Although these methods can achieve good performance for in-domain data, their performance usually drops quickly when processing data from a domain that is different from that of the training data (Caruana and Niculescu-Mizil, 2006). To achieve better performance, we usually need to manually label a large amount of indomain data. However, the task of constructing labeled data is time-consuming and tedious. Currently, various methods have been proposed to solve this problem using out-of-domain data, including domain adaptation (Daumé III, 2009;Gui et al., 2017), multi-task learning (Ben-David et al., 2007), and dual learning (Chandrasekaran et al., 2014).
Most existing methods aim to learn the shared representations or parameters, which can reduce the classification or regression model errors of each task/domain. However, these methods usually ignore the fact that each domain has domainspecific features that should not be shared. From the language characteristics, in spite of the expressions that are similar to the newswire text, Twitter has some informal expressions that cannot be shared, as shown in Figure 1. Hu et al. (2013) investigated the characteristics of language on Twitter and found that Twitter users not only tended to mimic the linguistic practices of traditional media, like news, they also appeared to be developing linguistically unique styles. We also believe that the tweets simply follow the standard language rules, and at some point eventually deviate from those. Thus, tweets are a combination of formal expressions and informal expressions with conversions between different expression styles.
Based on the above observations, we believe that annotated sentences of newswire text can be selectively used to help tag contextual segments of tweets, especially for formal expressions. To achieve this goal, in this work we adopt bidirectional long short-term memory (bi-LSTM) networks (Schuster and Paliwal, 1997), which have been successfully used for various sequence tagging problems. However, different from previous methods, the formal expressions and informal expressions in a sentence should be separately modeled. Inspired by recent work on dynamic parameter prediction (Ha et al., 2016), we propose a method to generate the context-specific parameters of the bi-LSTM based on different styles of context for POS tagging. We evaluated our models on three different corpora. The results demonstrated that the proposed method can benefit from annotated newswire text data and achieve competitive performance. In addition, we visualized the context distributions and the change of parameters. The visualization results verified the fact that different contexts produce different parameters for POS tagging.
The main contributions of the paper can be summarized as follows.
1. We study problems in the segment modeling method to apply domain adaptation to the POS tagging task. Based on the observations from a linguistic perspective, we found that there are many shared expressions between the newswire and tweets, and some expressions cannot be shared.

2.
A novel neural network architecture based on the bi-LSTM was proposed to perform the task. Different parameters were applied in different contexts.
3. Experimental results demonstrated that the proposed method can benefit from different domains. We also conducted qualitative and quantitative analyses to show why our model can achieve better performance.

Approach
Twitter is responsible for colorful linguistic expressions, and full of the conversions between formal expressions and informal expressions. To address this problem, we propose the Dynamic Conversion Neural Networks (DCNN), which can dynamically generate different parameters for POS tagging based on different contexts as shown in Figure 2. Our model mainly consists of four parts: (1) a CNN layer for extracting word representations x i , (2) an MLP layer for producing low-dimensional context representations t i , (3) a hyper LSTM layer for generating the weights W of a main LSTM, and (4) the main LSTM layer with dynamic parameters for POS tagging. The architecture of the main network is the same with any sequence labeling model, which learns to map the word representations to the corresponding labels. However, the parameters of the main network can be modified according to our purpose. we use a low-dimensional context distribution vector t i as the input of the hyper LSTM to generate the weights of the main LSTM. Thus, the weights of the main LSTM will be subject to a change in context vector. Therefore, the main LSTM can predict the POS tag based on the different parameters. Ha et al. (2016) also proposed a HyperRNN network, in which the hyper net is influenced by the main net. This is inconsistent with our motivation. Different from (Ha et al., 2016), to make the parameters of the main LSTM totally controlled by the context distribution, we changed the architecture by cutting off the data path from the main LSTM to the hyper LSTM. This method can prevent the hidden states of the main LSTM from influencing the hyper LSTM. In addition, we add an extra layer of learning the context representations based on features returned by a CNN.

Word and Context Representations
Out-of-vocabulary words are frequently used in Twitter. Moreover, new symbols, abbreviations, and words are constantly being created. These make word representations difficult to address. Thus, robust methods should be used to extract the morphological and shape information from words.
Inspired by (Santos and Zadrozny, 2014), we adopted character-level convolutional neural networks (CNN layer) to tackle this problem, which can take all of the characters of the word into consideration and output important orthographic features (Santos and Zadrozny, 2014). Suppose that we are given the sentence X = {w 1 , w 2 , . . . } with vocabulary V of words. We use multiple filters with varying widths to obtain the orthographic feature vector c i for word w i . Then, the orthographic feature vector c i is concatenated to the word embedding w i to form word representation x i as the input of the main LSTM. Utilizing a bi-LSTM to model sentences, the model can extract the sequential relations and contextual information.
The context is a fixed window of words around target word. The context representations are learned by the MLP later. To do this, we apply a fully connected neural network, which takes sequential word representations in a fixed window as input to generate a low-dimensional vector t i as follows: (1) where [· ⊕ ·] represents concatenation operation. r represents the length from central word x i to the edge of the window. M LP is the multilayer perceptrons function, which transfers the context matrix to a low-dimensional vector. We apply M LP to every window of contexts. sof tmax denotes the softmax function that converts the context vector into a probability distribution. The goal is to learn an MLP layer that, given sequential word representations, estimates a distribution over the contexts.

Adaptive Weight Generation
The identical weights at each time step will limit the expressiveness of recurrent neural network (RNN) (Ha et al., 2016). To overcome the limitation, our model uses a small network (hyper network) taking low-dimensional context representations as inputs to dynamically generate the parameters of a large network (main network) for POS tagging. Different with (Ha et al., 2016), at every time step, the hyper LSTM only takes the context representation t i as an input and generates the hidden stateĥ i as an output. This hidden stateĥ i is used to generate the weights for the main LSTM at the same time step. The hyper LSTM and main LSTM are jointly trained with backpropagation and gradient descent. Next, we will give a more formal description of the weight generation.
The hyper LSTM is a standard LSTM (Hochreiter and Schmidhuber, 1997), which takes context vectors as inputs and outputs hidden states. The hyper LSTM is defined as follows: where φ denotes the tanh function, and σ is the sigmoid function. and • represent the Hadamard product and matrix product, respectively. We assume thatŷ is one of {ĝ,î,f ,ô}.
The hyper LSTM has Nĥ hidden units, and N t is the dimensionality of t i . Then, Wŷ t ∈ R Nĥ×Nt , Wŷ h ∈ R Nĥ×Nĥ ,bŷ ∈ R Nĥ are the parameters of the hyper LSTM and stay invariable during one sentential sequence.
Inspired by (Ha et al., 2016), we adopted a weight scaling vector d which is a linear projection ofĥ i . d is used to linearly scale each row of the weight matrix in the standard LSTM. Because the context vector t is produced by the different contexts at each time step, the hidden state h i and d i will change corresponding to t i . Thus, the main LSTM can be modified as follows: where ⊗ represents the element-wise product with broadcasting. y is one of {g, i, f, o}. Generally, Nĥ and N t are much smaller than N h and N x , respectively. Thus, the size of the parameters in the hyper LSTM is hundreds of times less than that of the standard LSTM.
According to the above functions, if the model is given contexts with different styles, it will generate different parameters for all kinds of gates in the main LSTM. Then, the outputs of the main LSTM are used to predict the POS tags of the central words with the cross entropy loss:  where z i is the one-hot vector of the POS tagging label corresponding to x i .ẑ i is the output of the top softmax layer:ẑ i = sof tmax(M LP (h i )).

Experimental Setup
In this section, we will first detail the datasets we used. Then, we will describe several baseline methods, including a number of classic taggers and a series of deep learning sequence labeling methods.

Datasets
Following (Derczynski et al., 2013), we use RIT-Twitter (Ritter et al., 2011) as our main dataset. The RIT-Twitter was split into training, development and evaluation sets (RIT-Train, RIT-Dev, RIT-Test). The splitting method was shown in (Derczynski et al., 2013). In order to verify the validity of our model, we also tested it on two more datasets, NPSChat (Forsythand and Martell, 2007), and ARK-Twitter (Gimpel et al., 2011) using standard splits. The tag-sets of the RIT-Twitter and NPSChat are PTB-like, while that of the ARK-Twitter is specific. In order to use WSJ labeled data in experiments on ARK-Twitter, we performed the mapping from PTB tagsets to ARK tag-sets, according to the PTB POS Tagging Guidelines (Santorini, 1990) and ARK Guidelines 1 . The mapping proceeded from fine to coarse. For pretraining the word embedding, we constructed a dataset containing 30 million tweets, from Twitter using its API. We introduced a newswire dataset containing 1173K tokens as the written language dataset, namely the Wall Street Journal (WSJ) from the Penn TreeBank v3 (Marcus et al., 1993). During training, we mixed each of RIT-Twitter, NPSChat and ARK-Twitter with WSJ into three kinds of training data.
The detailed data statistics of the above datasets used in this work are listed in Table 1.

Competitor Methods
We applied several classic and state-of-the-art methods for comparison. In addition, we used a series of deep learning sequence labeling methods as baselines for comparison, as follows: Stanford POS Tagger is a widely used partof-speech taggers described in (Toutanova et al., 2003). It demonstrates the broad use of features and appropriate model regularization, which produces a superior level of performance (97.24%). In this work, we trained it using two different sets: sections 0-18 of the WSJ (Stanford-WSJ) and a mixed corpus of WSJ, IRC, and Twitter (Stanford-MIX).
T-POS (Ritter et al., 2011) adopts hierarchical clustering and Brown clustering methods to address the issue of OOV words and lexical variations. It also uses conditional random fields and other standard sets of features to perform the task. In this work, we trained it using three different sets: the WSJ (T-POS-WSJ), RIT-Train (T-POS-RIT) and a mixed corpus of WSJ, IRC, and RIT-Train (T-POS-MIX).
GATE Tagger (Derczynski et al., 2013) uses an approach that combines the available taggers for different tagsets. The tagger adopts a voteconstrained bootstrapping method with unlabeled data and assigns prior probabilities to handle of unknown words and slang.
ARK Tagger (Owoputi et al., 2013) is a system that reports the best accuracy on ARK-Twitter. It uses unsupervised word clustering and a variety of lexical features.
TPANN (Gui et al., 2017) applies adversarial networks and autoencoder to model labeled outof-domain data, unlabeled in-domain data and labeled in-domain data and achieved the best performance on RIT-Twitter.
Bidirectional LSTM (Bi-LSTM) (Wang et al., 2015) has been widely used in a variety of sequence labeling tasks. In this work, we also evaluated it as a baseline.
Bi-HyperLSTM (Ha et al., 2016) was used as a substitute for the standard Bi-LSTM. What makes the Bi-HyperLSTM model different from the proposed model is that we used context distribution to generate the parameters of main LSTM.

Initialization and Hyperparameter
The word embeddings for all the models were initialized with the word2vec tool (Mikolov et al., 2013) on 30 million tweets. The other parameters excluding the word embeddings, such as the parameters in LSTM and MLP, were initialized by randomly sampling from a uniform distribution in [-0.05, 0.05].
The dimensionality of the word embedding was set at 200. The dimensionality for the randomly initialized character embedding was set at 25. We adopted a hyper LSTM with 160 hidden neurons to produce the weights of each gates of the main LSTM with 250 hidden neurons. The dimensionality of the context vector was set at 10.
Our DCNN could be trained end-to-end with backpropagation and gradient-based optimization was performed using the Adam update rule (Kingma and Ba, 2014) with learning rate 0.0001.

Results and Analysis
In this section, we will report the experimental results and a detailed analysis of the results for the three different datasets.

Evaluation on RIT-Twitter
The RIT-Twitter was introduced in (Ritter et al., 2011). This dataset uses a tagset based on the Penn Treebank tagset with several Twitter-specific tags: retweets, @usernames, hashtags, and urls. Table 2 lists the results of our method compared with other methods on this dataset. The first part shows the results of the classic methods. From the result of Stanford-WSJ, we can see that although it can achieve a superior level of performance (97.24%) on the WSJ dataset, the accuracy drops significantly to 73.37% when applied to the Twitter dataset. If we add some indomain data to the training set, the Stanford-MIX can improve by 10% compared to the Stanford-WSJ. The same phenomenon can be observed from T-POS tagger. If we apply more features, like clustering, bootstrapping and lexical features, the T-POS, GATE tagger and ARK tagger can achieve better performances. Although TPANN achieve accuracy of 90.92%, it incorporates additional a large amount of in-domain unlabeled data. Our method is more competitive because of the use of much fewer data sets.

Methods
Training Set RIT-Test RIT-Dev Stanford-WSJ (Toutanova et al., 2003) -73.37% Stanford-MIX -83.14% T-POS-WSJ (Ritter et al., 2011) 81.30% T-POS-RIT 84.55% 84.83% T-POS-MIX 88.30% -GATE Tagger (Derczynski et al., 2013) 88.69% 89.37% ARK Tagger (Owoputi et al., 2013) 90.40% -TPANN (Gui et al., 2017) 90  The second part shows the results of the deep learning methods trained on the RIT-Train dataset. We can see that if the sequence labeling methods are just trained on the RIT-Train dataset, their accuracies can exceed those of most conventional taggers. Thus, the deep learning methods are competitive and avoid feature engineering. Compared with other models, the DCNN achieved best performance among the models just trained on RIT-Train dataset.
The third part shows the results of the deep learning methods trained on the mixed dataset of the RIT-Train and WSJ. As observed, when we added the WSJ data to train the models, all of them could obtain different degrees of improvement. Moreover, our model could make better use of the out-of-domain data and obtained the best result. Compared with the ARK tagger, which achieved the previous best result in conventional methods, our model was almost 0.78% better. The error reduction rate was more than 8%. Our model also outperformed the TPANN, which incorporated additional unlabeled in-domain data.
From the perspective of utilizing a lowdimensional context vector, we provided the same information (word information and context information) for all of the deep learning models as shown in Table 2. However, except for the DCNN, the other models were incapable of utilizing the context information. Most of the models could not obtain obvious improvement. In contrast, our DCNN could make better use of the context information to generate more appropriate parameters for POS tagging. Next, we will analyze the behavior how the DCNN changes parameters when encountering different context vectors.
Intuitively, contexts with different language expression styles should be transformed into different vectors. Figure 3 visualizes the context distribution. Subfigure (a) shows the context vector extracted from WSJ. We can see that the formal expressions are mainly concentrated in the middle of the four dimensions. This phenomenon can be observed in the subfigure (b), where the formal expressions in the Twitter are concentrated in the middle of the same dimensions and the informal expressions are concentrated in another three dimensions. Notice that in our experimental setup, the dimensionality of the context vector is

P S k X K C a 7 Y H p o = " > A A A C 2 n i c j V H L S s N A F D 2 N r p f V X H l J l g E N 5 a k F n R Z c O O y g n A W 0 u S T t v Q N A n J R C h t N + 7 E r T / g V j 9
I / A P 9 C + + M U 1 C L 6 I Q k Z 8 6 9 5 8 z c e + 3 Q c 2 N u G K 8 p b W F x a X k l v Z p Z W 9 / Y 3 M p u 7 1 T j I I k c V n E C L 4 j q t h U z z / x y r C I H d C 3 R 7 u G Y n 3 a C 8 9 Y q h 0 6 x a M 3 I q W O Q 9 I E l B c R F q f p M p 5 I Z 8 H + 5 j 2 W n u J u I / r b y m t I L E e f 2 L 9 0 s 8 z / 6 k Q t H F 2 c y R p c q i m U j K j O U S 6 J 7 I q 4 u f 6 l K k 4 O I X E C d y g e E X a k c t Z n X W p i W b v o r S X j b z J T s G L v q N w E 7 + K W N G D z 5 z j n Q b W Q N 4 2 8 e V n M l Y p q 1 G n s 4 w B H N M 9 T l H C B M i r k P c Y j n v C s N b V b 7 U 6 7 / 0 z V U k q z i 2 9 L e / g A R J O X / g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 m M U y y P U s w M N O 5 1 P S k X K C a 7 Y H p o = " > A A A C 2 n i c j V H L S s N A F D 2 N r 1 p f V X H l J l g E N 5 a k F n R Z c O O y g n 1 A W 0 u S T t v Q N A n J R C h t N + 7 E r T / g V j 9 I / A P 9 C + + M U 1 C L 6 I Q k Z 8 6 9 5 8 z c e + 3 Q c 2 N u G K 8 p b W F x a X k l v Z p Z W 9 / Y 3 M p u 7 1 T j I I k c V n E C L 4 j q t h U z z / x y r C I H d C 3 R 7 u G Y n 3 a C 8 9 Y q h 0 6 x a M 3 I q W O Q 9 I E l B c R F q f p M p 5 I Z 8 H + 5 j 2 W n u J u I / r b y m t I L E e f 2 L 9 0 s 8 z / 6 k Q t H F 2 c y R p c q i m U j K j O U S 6 J 7 I q 4 u f 6 l K k 4 O I X E C d y g e E X a k c t Z n X W p i W b v o r S X j b z J T s G L v q N w E 7 + K W N G D z 5 z j n Q b W Q N 4 2 8 e V n M l Y p q 1 G n s 4 w B H N M 9 T l H C B M i r k P c Y j n v C s N b V b 7 U 6 7 / 0 z V U k q z i 2 9 L e / g A R J O X / g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 m M U y y P U s w M N O 5 1 P S k X K C a 7 Y H p o = " > A A A C 2 n i c j V H L S s N A F D 2 N r 1 p f V X H l J l g E N 5 a k F n R Z c O O y g n 1 A W 0 u S T t v Q N A n J R C h t N + 7 E r T / g V j 9 I / A P 9 C + + M U 1 C L 6 I Q k Z 8 6 9 5 8 z c e + 3 Q c 2 N u G K 8 p b W F x a X k l v Z p Z W 9 / Y 3 M p u 7 1 T j I I k c V n E C L 4 j q t h U z z / x y r C I H d C 3 R 7 u G Y n 3 a C 8 9 Y q h 0 6 x a M 3 I q W O Q 9 I E l B c R F q f p M p 5 I Z 8 H + 5 j 2 W n u J u I / r b y m t I L E e f 2 L 9 0 s 8 z / 6 k Q t H F 2 c y R p c q i m U j K j O U S 6 J 7 I q 4 u f 6 l K k 4 O I X E C d y g e E X a k c t Z n X W p i W b v o r S X j b z J T s G L v q N w E 7 + K W N G D z 5 z j n Q b W Q N 4 2 8 e V n M l Y p q 1 G n s 4 w B H N M 9 T l H C B M i r k P c Y j n v C s N b V b 7 U 6 7 / 0 z V U k q z i 2 9 L e / g A R J O X / g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 m M U y y P U s w M N O 5 1 P S k X K C a 7 Y H p o = " > A A A C 2 n i c j V H L S s N A F D 2 N r 1 p f V X H l J l g E N 5 a k F n R Z c O O y g n 1 A W 0 u S T t v Q N A n J R C h t N + 7 E r T / g V j 9 I / A P 9 C + + M U 1 C L 6 I Q k Z 8 6 9 5 8 z c e + 3 Q c 2 N u G K 8 p b W F x a X k l v Z p Z W 9 / Y 3 M p u 7 1 T j I I k c V n E C L 4 j q t h U z z / V Z h b v c Y / U w Y t b Q 9 l j N H p y L e O 2 G R b E b + F d 8 F L L W 0 O r 5 b t d 1 L E 5 U O 7 s 3 q V 0 H 7 X G / f T I 9 V q g w n b S z O S N v y K X P A 1 O B H N Q q B 9 k X N N F B A A c J h m D w w Q l 7 s B D T 0 4 A J A y F x L Y y J i w i 5 M s 4 w R Y a 0 C W U x y r C I H d C 3 R 7 u G Y n 3 a C 8 9 Y q h 0 6 x a M 3 I q W O Q 9 I E l B c R F q f p M p 5 I Z 8 H + 5 j 2 W n u J u I / r b y m t I L E e f 2 L 9 0 s 8 z / 6 k Q t H F 2 c y R p c q i m U j K j O U S 6 J 7 I q 4 u f 6 l K k 4 O I X E C d y g e E X a k c t Z n X W p i W b v o r S X j b z J T s G L v q N w E 7 + K W N G D z 5 z j n Q b W Q N 4 2 8 e V n M l Y p q 1 G n s 4 w B H N M 9 T l H C B M i r k P c Y j n v C s N b V b 7 U 6 7 / 0 z V U k q z i 2 9 L e / g A R J O X / g = = < / l a t e x i t > hj |, where |·| means absolute value, i and j are the sequence numbers of sentences. If the two sentences have a similar expression style, then the absolute value would be close to zero represented by white color. We use i ∼ j to represent it. On the contrary, We use i ∼ j to represent the different expression styles. We only visualize the weights on the last time step. set to 10. However, the values of the last three dimensions are close to zero. Consequently, we set this hyperparameter to 7 and we achieve a higher accuracy of 91.27%. Figure 4 shows how the weight matrix W o h in output gate gets changed when the model inputs different kinds of contexts. Through making a comparison among the sentences I, II and III. we can find that although the sentences II and III are both from Twitter, whereas the sentence I is from WSJ, If the style of sentence II is close to that of sentence I, then the model will produce similar weight values to achieve the task. If the style of sentence I is different from that of sentence III, then the model will produce different parameters more suitable for Twitter-specific sentences. The

Evaluation on NPSChat
The NPSChat Corpus (Forsythand and Martell, 2007) is a PTB-POS annotated dataset of Internet Relay Chat (IRC) room messages from 2006. The corpus consists of 10,567 posts out of approximately 500,000 posts gathered from various online chat services in accordance with their terms of service. The authors of the corpus made several decisions during the process that were unique to the chat domain regarding some abbreviations, emotions and misspelled words. For example, LOL and :-) were frequently encountered in the chat messages. Because these expressions conveyed emotion, they were treated as individual tokens and tagged as interjections (UH). Table 3 lists the results of different taggers evaluated on NPSChat. Our method was tested using the same setup as the experiments in (Forsyth, 2007). The training part contained 90% of the data. The testing part contained the remaining 10%. Based on the results, we can see that our method could achieve the best accuracy (94.0%), which was significantly better than 90.8% (Forsyth, 2007). They trained the tagger on a mix of several corpora tagged with the Penn Treebank tag set. Our method also outperformed

Evaluation on ARK-Twitter
The ARK-Twitter that contains 34K tokens uses a novel tagset. The training set (OCT27) is provided in (Gimpel et al., 2011). It is a dataset of POS-tagged tweets consisting almost entirely of tweets sampled from one particular day (October 27, 2010). However, the test set was introduced in (Owoputi et al., 2013), and contains 574 tweets (DAILY547). The DAILY547 consists of one random English tweet from every day between January 1, 2011 and June 30, 2012. Thus, the distribution between the training set and test set may be slightly different. For example, a substantial fraction of the messages in the training data are about a basketball game that occurred on that day.
The results of the ARK tagger and TweetNLP Tagger in Table 4 are reported in (Owoputi et al., 2013).
We can see that our method could significantly outperform the TweetNLP Tagger. However, our method was worse than the ARK tagger. By analyzing the incorrect results, we found that 20.3% of the errors occurred between nouns and proper nouns. Because our model does not incorporate any knowledge of proper nouns, it is difficult for it to recognize proper nouns from datasets. As reported in (Owoputi et al., 2013), if ARK-tagger does not add tag dictionary features and name list features, its performance will drop to 92.38%, which is lower than that of the DCNN. Thus, our model is also competitive when lacking of knowledge.

Related Work
At a very early time, Schmidhuber (1992) began to explore the concept of fast weights, in which one network can produce context-dependent weight changes for a second network (Schmidhuber, 1992(Schmidhuber, , 1993. Moreover, they provided the theoretical possibility of a recurrent network version. Recently, numerous studies have been conducted in this field (Moczulski et al., 2015;Fernando et al., 2016). De Brabandere et al. (2016 introduced a new framework called the dynamic filter network where the filters in the CNN are generated dynamically. Ha et al. (2016) explored the use of this approach in recurrent networks. Our work uses a different mechanism to generate parameters, which can make the parameters subject to a change in context representations. we cut off the data path from the main LSTM to the hyper LSTM. This method can prevent the hidden states of the main LSTM from influencing the hyper LSTM.
Recently, deep learning has achieved promising results on POS tagging. Santos and Zadrozny (2014) used a CNN to construct a character-based model for English (PTB) and Portuguese. Wang et al. (2015) used the bi-LSTM on WSJ and reported a state-of-the-art performance. However, because of a lack of training data and an unconstrained writing style, these models encountered resistance in the implementation process on Twitter. In this work, we focused on the linguistic correlation between Twitter and newswire and took the linguistic characteristics into consideration. To selectively utilize out-of-domain data, we used a low-dimensional context vector to generate different parameters for text with different expression styles and obtained better results.

Conclusion
In this work, we study the problem of incorporating labeled newswire texts for Twitter POS tagging tasks. From a linguistic perspective, we find that Twitter users not only tend to mimic the formal expressions of traditional media, like news, but they also appear to be developing linguistically informal styles. Hence, we predict that labeled data from the newswire should selectively be used to help tag contextual segments of tweets. To achieve this task, we introduce a novel deep neural network architecture that can dynamically generate different parameters based on different expression styles for POS tagging. To evaluate the performance of the proposed method, we compare the method with previous state-of-the-art methods on three different datasets. Experimental results demonstrate that the proposed method can achieve better performance in most cases. We also visualize some parameters learned for the proposed method to demonstrate the motivation for this work.