Neural Network for Heterogeneous Annotations

Multiple treebanks annotated under heterogeneous standards give rise to the research question of best utilizing multiple resources for improving statistical models. Prior research has focused on discrete models, leveraging stacking and multi-view learning to address the problem. In this paper, we empirically investigate heterogeneous annotations using neural network models, building a neural network counterpart to discrete stacking and multi-view learning, respectively, ﬁnding that neural models have their unique advantages thanks to the freedom from manual feature engineering. Neural model achieves not only better accuracy improvements, but also an order of magnitude faster speed compared to its discrete baseline, adding little time cost compared to a neural model trained on a single treebank.


Introduction
For many languages, multiple treebanks have been annotated according to different guidelines. For example, several linguistic theories have been used for defining English dependency treebanks, including Yamada and Matsumoto (2003), LTH (Johansson and Nugues, 2007) and Stanford dependencies (De Marneffe et al., 2006). For German, there exist TIGER (Brants et al., 2002) and TüBa-D/Z (Telljohann et al., 2006). For Chinese, treebanks have been made available under various segmentation granularities (Sproat and Emerson, 2003;Emerson, 2005;Xue, 2003). These give rise to the research problem * Work done when the first author was visiting SUTD. of effectively making use of multiple treebanks under heterogeneous annotations for improving output accuracies (Jiang et al., 2015;Johansson, 2013;Li et al., 2015).
The task has been tackled using two typical approaches. The first is based on stacking (Wolpert, 1992;Breiman, 1996;Wu et al., 2003). As shown in Figure 1(a), the main idea is to have a model trained using a source treebank, which is then used to guide a target treebank model by offering source-style features. This method has been used for leveraging two different treebanks for word segmentation (Jiang et al., 2009;Sun and Wan, 2012) and dependency parsing (Nivre and McDonald, 2008;Johansson, 2013).
The second approach is based on multi-view learning (Johansson, 2013;Li et al., 2015). The idea is to address both annotation styles simultaneously by sharing common feature representations. In particular, Johansson (2013) trained dependency parsers using the domain adaptation method of Daumé III (2007), keeping a copy of shared features and a separate copy of features for each treebank. Li et al. (2015) trained POS taggers by coupling the labelsets from two different treebanks into a single combined labelset. A summary of such multi-view methods is shown in Figure 1(b), which demonstrates their main differences compared to stacking (Figure 1(a)).
Recently, neural network has gained increasing research attention, with highly competitive results being reported for numerous NLP tasks, including word segmentation (Zheng et al., 2013;Pei et al., 2014;, POS-tagging (Ma et al., 2014;Plank et al., 2016), and parsing (Chen and Manning, 2014;Dyer et al., 2015;Weiss et al., 2015;. On the other hand, the aforementioned methods on heterogeneous annotations are investigated mainly for discrete models. It remains an interesting research question how effective multiple treebanks can be utilized by neural NLP models, and we aim to investigate this empirically. We follow Li et al. (2015), taking POS-tagging for case study, using the methods of Jiang et al. (2009) andLi et al. (2015) as the discrete stacking and multi-view training baselines, respectively, and building neural network counterparts to their models for empirical comparison. The base tagger is a neural CRF model Lample et al., 2016), which gives competitive accuracies to discrete CRF taggers.
Results show that neural stacking allows deeper integration of the source model beyond one-best outputs, and further the fine-tuning of the source model during the target model training. In addition, the advantage of neural multi-view learning over its discrete counterpart are many-fold. First, it is free from the necessity of manual cross-labelset interactive feature engineering, which is far from trivial for representing annotation correspondence (Li et al., 2015). Second, compared to discrete model, parameter sharing in deep neural network eliminates the issue of exponential growth of search space, and allows separated training of each label type, in the same way as multi-task learning (Collobert et al., 2011). Our neural multi-view learning model achieves not only better accuracy improvements, but also an order of magnitude faster speed compared to its discrete baseline, adding little time cost compared to a neural model trained on a single treebank.
The C++ implementations of our neural network stacking and multi-view learning models are available under GPL, at https://github.com/chenhongshen/NNHetSeq.

Baseline Neural Network Tagger
We adopt a neural CRF with a Long-Short-Term-Memory (LSTM) (Hochreiter and Schmidhuber, 1997) feature layer for baseline POS tagger. As shown in Figure 2, the model consists of three main neural layers: the input layer calculates dense representation of input words using attention model on character embeddings; the feature layer employs a bi-directional LSTM model to extract non-local features from input vectors; the output layer uses a CRF structure to infer the most likely label for each input word.

Input Layer
Given a sentence w (1:n) , the input layer builds a vector representation ⃗ r i w for each word w i based on both word and character embeddings. In particular, an embedding lookup table is used to convert vocabulary words into their embedding forms directly. To obtain a character based embedding of w i , we denote the character sequence of w i with c (1:n) , where c j is the jth character in w i .
A character lookup table is used to map each char-  acter c j into a character embedding ⃗ e j c . The character embeddings ⃗ e 1 c , ⃗ e 2 c , ..., ⃗ e m c are combined using an attention model (Bahdanau et al., 2015): where a j c is the weight for ⃗ e j c , ⊙ is the Hadamard product function, and ∑ m j=1 a j c = 1. Each a j c is computed according to both the word embedding vector and 5-character embedding window with the current character ⃗ e j c in the middle: Here ⊕ denotes vector concatenation and ⃗ e i w is the embedding of current word w i . W t , U t , W c and ⃗ b t , ⃗ b c are model parameters. Finally, ⃗ w i c is concatenated with word embedding to form final word representation ⃗ r i w : ⃗ r i w = ⃗ e i w ⊕ ⃗ w i c

Feature Layer
Recently, bi-directional LSTM has been successfully applied in various NLP tasks Zhou and Xu, 2015;Klerke et al., 2016;Plank et al., 2016). The feature layer uses a bi-directional LSTM to extract a feature vector ⃗ h i for each word w i , respectively. An input vector ⃗ w ) is used to represent each word w i . We use a LSTM variation with peephole connections (Graves and Schmidhuber, 2005) to extract features based on ⃗ x (1:n) . The model computes a hidden vector ⃗ h i for each input ⃗ x i , passing information from ⃗ h 1 , ..., ⃗ h i−1 to ⃗ h n via a sequence of cell states ⃗ c 1 , ⃗ c 2 , ..., ⃗ c n . Information flow is controlled using an input gate ⃗ g i , a forget gate ⃗ f i , and an output gate ⃗ o i : where σ denotes the component-wise sigmoid function.
Bi-directional extension of the above LSTM structure is applied in both the left-to-right direction and the right-to-left direction, resulting in two hidden vector sequences h

Output Layer
The output layer employs a conditional random field (CRF) to infer the POS t i of each word w i based on the feature layer outputs. The conditional probability of a tag sequence given an input sentence is: where Z(⃗ x) is the partition function: In particular, the output clique potential Ψ(⃗ x, ⃗ y i ) shows the correlation between inputs and output labels: Ψ(⃗ x, ⃗ y i ) = exp(⃗ s i ), with the emission vector ⃗ s i being defined as: where ⃗ θ 0 is the model parameter. The edge clique potential shows the correlation between consecutive output labels using a single transition weight τ (⃗ y i , ⃗ y i−1 ).  3 Stacking

Discrete Stacking
Stacking integrates corpora A and B by first training a tagger on corpus A, and then using the A tagger to provide additional features to a corpus B model. Figure 1(a) shows the training and testing of discrete stacking models, where the B tagger extracts features from both the raw sentence and A tagger output. This method achieves feature combination at the one-best-output level.

Neural Stacking
Figure 3(a) and (b) shows the two neural stacking methods of this paper, respectively. Shallow Integration. Figure 3(a) is a variation of discrete stacking, with the output tags from tagger A being converted to a low-dimensional dense embedding features, and concatenated to the word embedding inputs to tagger B. Formally, for each word w i , denote the tagger A output as t i a , we concatenate the embedding form of t i a , denoted as ⃗ e i a , to the word representation ⃗ r i w .
Deep Integration. Figure 3(b) offers deeper integration between the A and B models, which is feasible only with neural network features. We call this method feature-level stacking. For feature-level integration, the emission vector ⃗ s i in Eq.(1) is taken for input to tagger B via a projection: where W s is a model parameter.
Fine-tuning. Feature-level stacking further allows tagger A to be fine-tuned during the training of tagger B, with the loss function being back propagated to tagger A via the ⃗ w i a layer (shown in the red dotted lines in Figure 3(b)). This is a further benefit of neural stacking compared with discrete stacking.

Discrete Label Coupling
As shown in Figure 1(b), multi-view learning (Li et al., 2015) utilizes corpus A and corpus B simultaneously for training. The coupled tagger directly learns the logistic correspondences between both corpora, therefore can lead a more comprehensive usage of corpus A compared with stacking. In order to better capture such correlation, specifically designed feature templates between two tag sets are essential.
For each training instances, both A and B labels are needed. However, one type of tag is missing. Li et al. (2015) used a mapping function to supplement the missing annotations with the help of the annotated tag. The result is a set of sentence with bundled tags in both annotations, but with ambiguities on one side, due to one-to-many mappings. Li et al. (2015) showed that speed can be significantly improved by manually restricting possible mappings between the labelsets, but a full mapping without restriction yields the highest accuracies.

Neural Multi-task Learning
Neural multi-task learning is free from manual feature engineering, and avoids manual mapping func-tions between tag sets by establishing two separate output layers, one for each type of label, with shared low-level parameters. The general structure of a neural multi-view model is shown in Figure 4, which can be regarded as a variation of the parameter sharing model of Caruana (1993) and Collobert et al. (2011). Leveraging heterogeneous annotations for the same task, compared to parameter sharing between different NLP tasks (Collobert et al., 2011), can benefit from tighter integration of information, and hence allows deeper parameter sharing. These are verified empirically in the experiments.
In training and testing, sentences from both corpora go through the same input layer and feature layer. The outputs of each type of tag is then computed separately according to the shared parameters. The conditional probability of a tag sequence given an input sentence and its corpus type is: where T is the corpus type, T ∈ {A, B}. Ψ T (⃗ x, ⃗ y i ) and Φ T (⃗ x, ⃗ y i , ⃗ y i−1 ) are the corresponding output clique potential and edge clique potential, respectively. Z T (⃗ x) is the partition function: This indicates that each time only one output layer is activated according to the corpus type of input sentences.

Training
A max-margin objective is used to train the full set of model parameters Θ: where ⃗ x d , ⃗ y d | D d=1 are the training examples, λ is a regularization parameter, and l(⃗ x d , ⃗ y d , Θ) is the max-margin loss function towards one example (⃗ x d ,ȳ d ).
The max-margin loss function is defined as:

Algorithm 1 Neural multi-view training
Input: Two training datasets:  where ⃗ y is the model output, s(⃗ y|⃗ x) = logP (⃗ y|⃗ x) is the log probability of ⃗ y and δ(⃗ y, ⃗ y d ) is the Hamming distance between ⃗ y and ⃗ y d .
We adopt online learning, updating parameters using AdaGrad (Duchi et al., 2011). To train the neural stacking model, we first train a base tagger using corpus A. Then, we train the stacked tagger with corpus B, where the parameters of the A tagger has been pretrained from corpus A and the B tagger is randomly initialized.
For neural multi-view model, we follow Li et al. (2015) and take a the corpus-weighting strategy to sample a number of training instances from both corpora for each training iteration, as shown in Algorithm 1. At each epoch, we randomly sample from the two datasets according to a corpus weights ratio, namely the ratio between the number of sentences in each dataset used for training, to form a training set for the epoch.

Experimental Settings
We adopt the Penn Chinese Treebank version 5.0 (CTB5) (Xue et al., 2005) as our main corpus, with the standard data split following previous work (Zhang and Clark, 2008;Li et al., 2015). People's Daily (PD) is used as second corpus with a different scheme. We filter out PD sentences longer than 200 words. Details of the datasets are listed in Table  1. The standard token-wise POS tagging accuracy is used as the evaluation metric. The systems are implemented with LibN3L .
For all the neural models, we set the hidden layer size to 100, the initial learning rate for Adagrad to 0.01 and the regularization parameter λ to 10 −8 . word2vec 1 is used to pretrain word embeddings. The Chinese Giga-word corpus version 5 (Graff and Chen, 2005), segmented by zpar 2 (Zhang and Clark, 2011), is used for the training corpus for word embeddings. The size of word embedding is 50.

Development Experiments
We use the development dataset for two main purposes. First, under each setting, we tune the model parameters, such as the number of training epochs. Second, we study the influence of several important hyper-parameters using the development dataset. For example, for the NN multi-view learning model, the corpus weights ratio (section 5) plays an important role for the performance. We determine the parameters of the model by studying the accuracy along with the increasing epochs.
Effect of batch size and dropout. The batch size affects the speed of training convergence and the final accuracies of the neural models, and the dropout rate has been shown to significantly influence the performance . We investigate the effects of these two hyper-parameters by adopting a corpus weight ratio of 1 : 1 (All the CTB training data is used, while the same amount of PD is sampled randomly), drawing the accuracies of the neural multi-view learning model against the number of training epochs with various combinations of the dropout rate d and batch size b.  shown for the multi-view learning model. For the stacking model, we use b=100 for the PD sub model. The results are shown in Figure 5, where the two dashed lines on the top at epoch 30 represent the dropout rate of 20%, the two solid lines in the middle represent zero dropout rate, and the two dotted lines in the bottom represent a dropout rate 50%. Without using dropout, the performance increases in the beginning, but then decreases as the number of training epochs increases beyond 10. This indicates that the NN models can overfit the training data without dropout. However, when a 50% dropout rate is used, the initial performances are significantly worse, which implies that the 50% dropout rate can be too large and leads to underfitting. As a result, we choose a dropout rate of 20% for the remaining experiments, which strikes the balance between over-System Accuracy CRF Baseline (Li et al., 2015) 94.10 CRF Stacking (Li et al., 2015) 94.81 CRF Multi-view (Li et al., 2015) 95  fitting and underfitting. Figure 5 also shows that the batch size has a relative small influence on the accuracies, which varies according to the dropout rate. We simply choose a batch size of 1 for the remaining experiments according to the performance at the dropout rate 20%.
Effect of corpus weights ratio. Figure 6 shows the effects of different corpus weights ratios. In particular, a corpus weights ratio of 1:0.2 yields relative low accuracies. This is likely because it makes use of the least amount of PD data. The ratios of 1:1 and 1:4 give comparable performances. We choose the former for our final tests because it is a much faster choice. Table 2 shows the final results on the CTB test data. We lists the results of stacking method of Jiang et al. (2009) re-implemented by Li et al. (2015), and CRF multi-view method reported by Li et al. (2015). We adopt pair-wise significance test (Collins et al., 2005) when comparing the results between two different models.

Final Results
Stacking. For baseline tagging using only CTB, NN model achieves a result of 94.24, slightly higher than CRF baseline (94.10). NN stacking model integrating PD data achieves comparable performance (94.74) compared with CRF stacking model (94.81). Compared with NN baseline, NN stacking model boosts the performance from 94.24 to 94.74, which is significant at the confidence level p < 10 −5 . This demonstrates that neural network model can utilize one-best prediction of the PD model for the CTB task as effectively as the discrete stacking method of Jiang et al. (2009).
One advantage of NN stacking as compared with discrete stacking method is that it can directly lever-age features of PD model for CTB tagging. Comparison between feature-level stacking and one-bestoutput level stacking of the NN stacking model shows that the former gives significantly higher results, namely 95.01 vs 94.74 at the confidence level p < 10 −3 .
One further advantage of NN stacking is that it allows the PD model to be fine-tuned as an integral sub-model during CTB training. This is not possible for the discrete stacking model, because the output of the PD model are used as atomic feature in the stacking CTB model rather than a gradient admissive neural layer. By fine-tuning the PD sub-model, the performance is further improved from 95.01 to 95.32 at the confidence level p < 10 −3 . The final NN stacking model improves over the NN baseline model from 94.24 to 95.32. The improvement is significantly higher compared to that by using discrete stacking which improves over the discrete baseline from 94.01 to 94.74. The final accuracy of the NN stacking model is higher than that of the CRF stacking model, namely 94.81 vs 95.32 at the confidence level p < 10 −3 . This shows that neural stacking is a preferred choice for stacking.
Multi-view training. With respect of the multiview training method, the NN model improves over the NN baseline from 94.24 to 95.40, by a margin of +1.16, which is higher than that of 0.90 brought by discrete method of Li et al. (2015) over its baseline, from 94.10 to 95.00. NN multi-view training method gives relatively higher improvements compared with NN stacking method. This is consistent with the observation of Li et al. (2015), who showed that discrete label coupling training gives slightly better improvement compared with discrete stacking. The final accuracies of NN multi-view training is also higher than that of its CRF counterpart, namely 95.00 vs 95.40 at the confidence level p < 10 −3 . The difference between the final NN multi-view training result of 95.40 and the final NN stacking results is not significant. 3 Integration. The flexibility of the NN models further allows both stacking (on the input) and multiviewing (on the output) to be integrated. When

Speed Test
We compare the efficiencies of neural and discrete multi-view training by running our models and the model of Li et al. (2015) 4 with default configurations on the CTB5 training data. The CRF baseline is adapted from Li et al. (2015). All the systems are implemented in C++ running on an Intel E5-1620 CPU. The results are shown in Table 3. The NN baseline model is slower than the CRF baseline model. This is due to the higher computation cost of a deep neural network on a CPU. Compared with the CRF baseline, the CRF multi-view model is significantly slower because of its large feature set and the multi-label search space. However, the NN multi-view model achieves almost the same time cost with the NN baseline, and is much more efficient than the CRF counterpart. This shows the efficiency advantage of the NN multi-view model by parameter sharing and output splitting.

Related Work
Early research on heterogeneous annotations focuses on annotation conversion. For example, Gao et al. (2004) proposed a transformation-based method to convert the annotation style of a word segmentation corpus to that of another. Manually designed transformation templates are used, which makes it difficult to generalize the method to other tasks and treebanks. Jiang et al. (2009) described a stacking-based model for heterogeneous annotations, using a pipeline to integrate the knowledge from one corpus to another. Sun and Wan (2012) proposed a structure-based stacking model, which makes use of structured features such as sub-words for model combination. These feature integration is stronger compared to those of Jiang et al. (2009). Johansson (2013 introduced path-based feature templates in using one parser to guide another. In contrast to the above discrete methods, our neural stacking method offers further feature integration by directly connecting the feature layer of the source tagger with the input layer of the target tagger. It also allows the finetuning of the source tagger. As one of the reviewers mentioned, two extensions of CRFs, dynamic CRFs (Sutton et al., 2004) and hidden-state CRFs (Quattoni et al., 2004), can also perform similar deep integration and fine-tuning.
For multi-view training, Johansson (2013) used a shared feature representation along with separate individual feature representation for each treebank. Qiu et al. (2013) proposed a multi-task learning model to jointly predict two labelsets given an input sentences. The joint model uses the union of baseline features for each labelset, without considering additional features to capture the interaction between the two labelsets. Li et al. (2015) improves upon this method by using a tighter integration between the two labelsets, treating the Cartesian product of the base labels as a single combined labelset, and exploiting joint features from two labelsets. Though capturing label interaction, their method suffers speed penalty from the sharply increased search space. In contrast to their methods, our neural approach enables parameter sharing in the hidden layers, thereby modeling label interaction without directly combining the two output labelsets. This leads to a lean model with almost the same time efficiency as a single-label baseline.
Recently, Zhang and Weiss (2016) proposed a stack-propagation model for learning a stacked pipeline of POS tagging and dependency parsing. Their method is similar to our neural stacking in fine-tuning the stacked module which yields features for the target model. While their multi-task learning is on heterogenous tasks, our multi-task learning is 738 defined on heterogenous treebanks.

Conclusion
We investigated two methods for utilizing heterogeneous annotations for neural network models, showing that they have respective advantages compared to their discrete counterparts. In particular, neural stacking allows tighter feature integration compared to discrete stacking, and neural multi-view training is free from the feature and efficiency constraints of discrete one. On a standard CTB test, the neural method gives the best integration effect, with a multi-view training model enjoying the same speed as its single treebank baseline.