A Progressive Learning Approach to Chinese SRL Using Heterogeneous Data

Previous studies on Chinese semantic role labeling (SRL) have concentrated on a single semantically annotated corpus. But the training data of single corpus is often limited. Whereas the other existing semantically annotated corpora for Chinese SRL are scattered across different annotation frameworks. But still, Data sparsity remains a bottleneck. This situation calls for larger training datasets, or effective approaches which can take advantage of highly heterogeneous data. In this paper, we focus mainly on the latter, that is, to improve Chinese SRL by using heterogeneous corpora together. We propose a novel progressive learning model which augments the Progressive Neural Network with Gated Recurrent Adapters. The model can accommodate heterogeneous inputs and effectively transfer knowledge between them. We also release a new corpus, Chinese SemBank, for Chinese SRL. Experiments on CPB 1.0 show that our model outperforms state-of-the-art methods.


Introduction
Semantic role labeling (SRL) is one of the fundamental tasks in natural language processing because of its important role in information extraction (Bastianelli et al., 2013), statistical machine translation (Aziz et al., 2016;Xiong et al., 2012), and so on.
However, state-of-the-art performance of Chinese SRL is still far from satisfactory. And data sparsity has been a bottleneck which can not be 1 http://www.klcl.pku.edu.cn/ShowNews.aspx?id=156 ] 。 Figure 1: Sentences from (a) CPB and (b) our heterogeneous dataset. In CPB, each predicate (e.g., 修改) has a specific set of core roles given with numbers (e.g., Arg0). While our dataset uses a different semantic role set, and all roles are nonpredicate-specific.
To mitigate the data sparsity, models incorporating heterogeneous resources have been introduced to improve Chinese SRL performance (Wang et al., 2015;Guo et al., 2016;. The heterogeneous resources introduced by these models include other semantically annotated corpora with annotation schema different to that used in PropBank, and even of a different language. The challenge here lies in the fact that those newly introduced resources are heterogeneous in nature, without sharing the same tagging schema, semantic role set, syntactic tag set and domain. For example, Wang et al. (2015) introduced a heterogeneous dataset, Chinese Net-Bank, by pretraining word embeddings. Specifically, they learn an LSTM RNN model based on NetBank first, then initialize a new model with the pretrained embeddings obtained from NetBank, and then train it on CPB. Chinese NetBank (Yulin, 2007) is also a corpus annotated with semantic roles, but using a very different role set and annotation schema. Wang's method can inherit knowledge acquired from other resources conveniently, but only at word representation level, missing more generalized semantic meanings in higher hidden layers.  proposed a twopass training approach to use corpora of two languages, but a few non-common roles are ignored in the first pass. Guo et al. (2016) proposed a unified neural network model for SRL and relation classification (RC). It can learn two tasks at the same time, but cannot filter out harmful features learned in incompatible tasks.
Recently, Progressive Neural Networks (PNN) model was proposed by Rusu et al. (2016) to transfer learned reinforcement learning policies from one game to another, or from simulation to the real robot. PNN "freezes" learned parameters once starting to learn a new task, and it uses lateral connections, namely adapter, to access previously learned features.
Inspired by the PNN model, we propose a progressive learning model to Chinese semantic role labeling in this paper. Especially, we extend the model with Gated Recurrent Adapters (GRA). Since the standard PNN takes pixels as input, policies as output, it is not suitable for SRL task we focus in this context. Moreover, to handle long sentences in the corpus, we enhance adapters with internal memories, and gates to keep the gradient stable. The contributions of this paper are threefold: 1. We reconstruct PNN columns with bidirectional LSTMs to introduce heterogeneous corpora to improve Chinese SRL. The architecture can also be applied to a wider range of NLP tasks, like event extraction and relation classification, etc.
2. We further extend the model with GRA to remember and take advantage of what has been transferred, thus improve the performance on long sentences.
3. We also release a new corpus, Chinese Sem-Bank, which was annotated with the schema different to that used in CPB. We hope that it will be helpful for future work on SRL tasks.  We use our new corpus as a heterogeneous resource, and evaluate the proposed model on the benchmark dataset CPB 1.0. The experiment shows that our approach achieves 79.67% F1 score, significantly outperforms existing state-ofthe-art systems by a large margin (Section 5).

Heterogeneous Corpora for Chinese SRL
In this paper, we provide a new SRL corpus Chinese SemBank (CSB) and use it as an example of heterogeneous data in our experiments. In this section, we first briefly introduce the corpus, then compare it to existing corpora. Sentences in CSB are from various sources including online articles and news. The vision of this project is to build a very large and complete Chinese semantic corpus in the future. Currently, it only focuses on the predicate-argument structures in a sentence without annotation of the temporal relations and coreference. CBS is different with respect to commonly used dataset CPB in the following aspects: • In terms of predicate, CSB takes wider range of predicates into account. We not only annotated common verbs, but also nominal verbs, as NomBank does, and state words. Whereas CPB only annotate common verbs as predicates.
• In terms of semantic roles, CSB has a more fine-grained semantic role set. There are 31 roles defined in five types (as Table. 1 shows). Whereas in CPB, there are totally 23 roles, including core roles and non-core roles.
• CSB does not have any pre-defined frames for predicates because all roles are set to be non-predicate-specific. The reason for not defining frames is that frames may lead inconsistencies in labels. For example, according to Chinese verb formation theory (Sun et al., 2009), in CPB, an agent of a verb is often marked as its Arg0, but not all Arg0 are agents. Therefore, roles are defined for predicates with similar syntactic and semantic regularities, rather than single predicate.
Two direct benefits of using stand-alone nonpredicate-specific roles are: First, meanings of all semantic roles can be directly inferred from their labels. For instance, roles of things that people are telling (谈 ) or looking (看) are labeled as 内 容/content, because verbs like 谈 and 看 are often followed by an object. Second, we can easily annotate sentences with new predicates without defining new frame files.
Other Corpora for Chinese SRL Other popular semantic role labeling corpora include Chinese NomBank (Xue, 2006), Peking University Chinese NetBank (Yulin, 2007). NomBank, often used as a complement to PropBank, annotates nominal predicates and semantic roles according to the similar semantic schema as PropBank does. Peking University Chinese NetBank was created by adding a semantic layer to Peking University Chinese TreeBank (Zhou et al., 1997). It only uses non-predicate-specific roles as we do. And its role set is smaller, which has 20 roles.

Challenges in Inheriting Knowledge from Heterogeneous Corpora
Although there are a lot of annotated corpora for Chinese SRL as we mentioned in the previous section, most of them are quite small as compared to that in English. Data sparsity remains a bottleneck. This situation calls for larger training dataset, or effective approaches which can take ad-vantage of very heterogeneous datasets. In this paper, we focus on the second problem, that is, to improve Chinese SRL by using heterogeneous corpora together within one model.
We will consider the combination of the standard benchmark, CPB 1.0 dataset (Xue and Palmer, 2003), with the new corpus, CSB, because there are a lot of differences between them, as we discussed in Section 2. Consequently, a number of challenges arise for this task. Now we describe them as below. Inheriting from Different Schema and Role Sets. CPB was annotated with PropBank-style frames and roles, whereas Chinese FrameNet uses its own frames and roles. And our dataset has no frame files and use different role set. Therefore, it is hard to find explicit mapping or hierarchical relationships among their role sets, or decide which system is better, especially when there are more than two resources. Inheriting from Different Domain/Genre. The datasets mentioned above are composed of sentences from various sources, including news and stories, etc. However, it is well known that adding data in very different genre to training data may hurt parser performance (Bikel, 2004). Therefore, we also need to deal with domain adaptation problem when using heterogeneous data. In other words, the proposed approach should be robust to harmful features learned on incompatible datasets. It can also accommodate potentially different model structures and inputs in the procedure of knowledge fusion. Inheriting from Different Syntactic Annotation. Unlikes English, previous works (Ding and Chang, 2009;Sun et al., 2009) on Chinese SRL task often use both correct segmentation and part-of-speech tagging, and even treebank goldstandard parses (Xue, 2008) as their features. But some corpora like CPB and NetBank do not share the same PoS tag set, or do not have correct PoS tagging and gold treebank parses at all, like CSB. And in real application scenarios, it is more convenient to use automatic PoS tagging instead of goldstandard tagging on large datasets, as they can be obtained quickly. So to deal with the absence of syntactic features, we adopt automatic PoS tagging when training on CSB in this work.
Some previous techniques, such as finetuning after pretraining (Wang et al., 2015; and multi-task learning (Guo et al., 2016), have been used to deal with these challenges. Though they can also leverage knowledge from different domains, they have following drawbacks: finetuning cannot avoid catastrophic forgetting because learned parameters, whether embeddings or other hidden weights, will be tuned after the model has been initialized; And multi-task learning cannot ignore previously learned harmful features because some features are learned in shared layers, although it avoids forgetting by randomly selecting a task to learn at each iteration. Therefore, to solve the above-mentioned challenges, we further introduce progressive learning which we believe is more suitable for the task.

Progressive Learning Approach
We propose a progressive learning approach which is ideal for combining heterogeneous SRL data for multiple reasons. First, it can accommodate dissimilar inputs with different schema, syntactic information and domain, because it allow models for heterogeneous resources to be extremely different, such as different network structures, different width, and different learning rates, etc. Second, it is immune to forgetting by freezing learned weights and can leverage prior knowledge via lateral connections. Third, the lateral connections can be extended with recurrent structure and gate mechanism to handle with forgetting problem over long distance. Our model is mainly inspired by Rusu et al. (2016). They proposed progressive neural networks for a wide variety of reinforcement learning tasks (e.g. Atari games and robot simulation). In their cases, inputs are pixels, outputs are learned policies. And each column, consisting of simple layers and convolutional layers, is trained to solve a particular Markov Decision Process. But in our case, inputs are sentences annotated using different syntactic tagsets and outputs are semantic role sequences. So we change the structure of columns to recurrent neural networks with LSTM, similar to the model proposed by Wang et al. (2015). Below we first introduce basic progressive neural network architecture, then describe our model, PNN with gated recurrent adapters.

Progressive Neural Networks
Θ 1 denotes the parameters to be learned in the first column. When switching to a second corpus, it "freezes" the parameter Θ 1 and randomly initialize a new column with parameters Θ 2 and several lateral connections between two columns so that layer h 2 i can receive input from both h 2 i−1 and h 1 i−1 . In this straightforward manner, progressive neural networks can make use of columns with any structures or to compile lateral connections in an ensemble setting. To be more general, we calculate the output of ith layer in kth column h k i by: where W k i ∈ R n k i ×n k i−1 is the weight matrix of layer i of column k, U (k:j) i ∈ R n k i ×n j i−1 are the lateral connections to transfer information from layer i − 1 of column j to layer i of column k, h 0 is the input of the network. f can be any activation function, such as element-wise non-linearity. Bias term was omitted in the equation. Adapters. With implicit assumption that there is some "overlap" between the first task and the second task, pretrain-and-finetune learning paradigm is effective, as only slight adjustment to parameters is needed to learn new features. Progressive networks also have ability to transfer knowledge from previous tasks to improve convergence  Figure 3: Each column is a stacked bidirectional LSTM RNN model. Two columns are connected by GRAs. There are three gates in each GRA: g i , g f , and g o . The input gate g i and the forget gate g f can also be coupled as one uniform gate, that is speed. On the one hand, the model reuse previously learned features from left columns via lateral connections (i.e., adapters). On the other hand, new features can be learned by adding more columns incrementally. Moreover, when the "overlap" between two tasks is small, lateral connections can filter out harmful features by sigmoid functions. So in practice, the output of adapters can also be calculated by where A (k:j) i is a matrix to be learned. We treat Equation 2 as one of baseline settings in experiments.

PNN with Gated Recurrent Adapter for Chinese SRL
We reconstruct PNN with bidirectional LSTM to solve SRL problems. Our model is illustrated in Fig. 3. First, each column in the PNN architecture is a stacked bidirectional LSTM RNN, rather than convolutional neural networks, because inputs are sentences not pixels, and bi-LSTM RNN has proved powerful for Chinese SRL (Wang et al., 2015).
Second, we enhance the adapter with recurrent structure and gate mechanism, because the simple Multi-Layer Perceptron (MLP) adapters have a limitation: their weights are learned word after word independently. For tasks like transferring reinforcement learning policies, this is enough because there are little dependencies among actions. But in NLP domain, things are different. Therefore, we add internal memory to adapters to help them remember what has been inherited from heterogeneous resource.
Third, to keep gradient stable and balance between long-term and short-term memory, we introduce gate mechanism which has been widely used in RNN models. Intuitively, we call the new adapter Gated Recurrent Adapter (GRA).
Formally, let h . The output vector is multiplied by a learned matrix W a initialized by random small values before going to GRAs. Its role is to adjust for the different scales of the different inputs and reduce the dimensionality. Formally, the candidate outputs is where a t−1 is the output of the adapter at the previous time-step. U a is a weight matrix to learn. The output of an adapter a j t of layer i at time t can be formalized as follows, are parameters to learn. d i−1 is the dimension of the inner memory in adapters. ã t represents the inner state of the adapter. f is an activation function, like tanh. The input gate and the forget gate can be coupled as a uniform gate, that is g i = 1 − g f to alleviate the problem of information redundancy and reduce the possibility of overfitting (Greff et al., 2015).
Finally, we calculate the output of the next layer i of column k by i−1 is the parameters in ith layer.

Training Criteria
We adopt the sentence tagging approach as Wang et al. (2015) did, because words in a sentence may closely be related with each other, independently labeling each word is inappropriate. Sentence tagging approach only consider valid transition paths of tags when calculating the cost. For example, when using IOBES tagging schema, tag transition from I-Arg0 to B-Arg0 is invalid, and transition from I-Arg0 to I-Arg1 is also invalid because the type of the role changed inside the semantic chunk. For each task (column), the log likelihood of sentence x and its correct path y is where N is the number of words, o t ∈ R M is the output of the last layer at time t. y t = k means the tth word has the kth semantic role label. z ranges from all the valid paths of tags. The negative log likelihood of the whole training set D is We minimize J(Θ) using stochastic gradient descent to learn network parameters Θ. When testing, the best prediction of a sentence can be found using Viterbi algorithm.

Experiment Settings
To compare our approach with others, we designed four experimental setups: (1) A simple LSTM setup on CSB and CPB with automatic PoS tagging. Since CPB is about two times as large as the new corpus, we need to know whether CSB can be used for training good semantic parsers and how much information can be learned from CSB by machine. So we conduct this experiment to provide two baselines for CSB and CPB respectively. In this setup we train and evaluate a one-column LSTM model on CSB.
(2) A simple LSTM setup on CPB with pretrained word embedding on CSB (marked as bi-LSTM+CSB embedding). Previous work found that using pretrained word embeddings can improve performance (Wang et al., 2015) on Chinese SRL. So we conduct this experiment to compare with the method using embeddings trained on large-scale unlabeled data like Gigaword 2 , and NetBank.
(3) A two-column finetuning setup where we pretrain the first column on CSB and finetune both two columns on CPB. Clearly, finetuning is a traditional method for continual learning scenarios. But the disadvantage of it is that learned features will be gradually forgotten when the model is adapting new tasks. To assess this empirically, we design this experiment. The model uses the same network structure as PNN does, but it does not "freeze" parameters in the first column when tuning two columns.
(4) A progressive network setup where we train column 1 on CSB, then train column 2 and adapters on CPB. We conduct this experiment to evaluate the proposed model and compare it to all previous methods. To further analyze effectiveness of the new adapter structure, we also conduct an experiment for progressive nets with GRA.
We apply grid-search technique to explore hyper-parameters including learning rates and width of layers.

Preprocessing.
We follow the same data setting as previous work (Xue, 2008;Sun et al., 2009), which divided CPB dataset 3 into three parts: 648 files, from chtb_081.fid to chtb_899.fid, are the training set; 40 files, from chtb_041.fid to chtb_080.fid, are the development set; 72 files, from chtb_001.fid to chtb_040.fid, and chtb_900.fid to chtb_931.fid, are used as the test set.
We also divide shuffled CSB corpus into three sets with similar partition ratios. Currently, there are 10634 sentences in CSB. So 8900 samples are used as training set, 500 samples as development set and the rest 965 samples as test set. We use Stanford Parser 4 for PoS tagging.  Compare to Methods without Using Heterogeneous Data Table 3 summarizes the SRL performance of previous benchmark methods and our experiments described above. Collobert and Weston only conducted their experiments on English corpus, but we notice that their approach has been implemented and tested on CPB by Wang et al. (2015), so we also put their result here for comparison. We can make several observations from these results. Our approach significantly outperforms Sha et al. (2016) by a large margin (Wilcoxon Signed Rank Test, p < 0.05), even without using GRA. This result can prove the ability of our model to capture underlying similarities between heterogeneous SRL resources.

Performance on Chinese SemBank
Compare to Methods Using Heterogeneous Resources The results of methods using external language resources are also presented in Table 3. Not surprisingly, we see that the overall best F1 score, 79.67%, is achieved by the progressive nets with the GRAs. Furthermore, as shown in Fig. 4, PNN with GRA performs better on longer sentences, which is consistent with our expectation. Without GRA, the F1 drops 0.37% percentage point to 79.30, confirming that gated recurrent adapter structure is more suitable for our task because it can remember what has been transferred in previous time steps.
Compared to progressive learning methods, finetuning method does not perform well even with the same network structure (Two-column finetuning), but it is still better than simply pretraining word embeddings (bi-LSTM+CSB embedding). This confirms the effectiveness of multicolumn learning structure which add capacity to the model by adding new columns. Therefore, as can be seen, our PNN model achieves 79.30% F1 score, outperforming finetuning by 0.88% percentage point, and pretraining embeddings by even larger margin.
To sum up, not only network structures but also learning methods (finetuning/multitask/progressive) can influence the performance of knowledge transfer. According to the results, our PNN approach is more effective than others because it is immune to forgetting and robust to harmful features, and GRA is more suitable for our task than simple adapters. 6 Related Work

Chinese Semantic Role Labeling
The concept of Semantic Role Labeling is first proposed by Gildea and Jurafsky(2002). Previous work on Chinese SRL mainly focused on how to improve SRL on single corpus. Approaches falls into two categories: feature-based machine learning approaches and neural-network-based approaches.
Using feature-based method, Sun and Jurafsky (2004) did the preliminary work and achieved promising results without using any large   (2003), more complete and systematic research on Chinese SRL were done (Xue and Palmer, 2005;Chen et al., 2006;Ding and Chang, 2009;Yang et al., 2014). Neural network methods do not rely on handcrafted features.
For Chinese SRL, Wang et al. (2015) proposed bidirectional a LSTM RNN model. And based on their work, Sha (2016) proposed quadratic optimization method as a postprocessing module and further improved the result.

Learning with Heterogeneous Data
In this paper, we mainly focus on learning with heterogeneous semantic resource for Chinese SRL. Wang et al. (2015) introduced heterogeneous data by using pretrained embeddings at initialization and achieved promising results. Guo et al. (2016) proposed a multitask learning method with a unified neural network model to learn SRL and relation classification task together and also achieved improvement.
Different from previous work, we proposed a progressive neural network model with gated recurrent adapters to leverage knowledge from heterogeneous semantic data. Compared with previous methods, this approach is more constructive, rather than destructive, because it uses lateral connections to access previously learned fea-tures which are fixed when learning new tasks. And by introducing gated recurrent adapters, we further enhance our model to deal with long sentences and achieve state-of-the-art performance on Chinese PropBank.

Conclusion and Future Work
In this paper, we proposed a progressive neural network model with gated recurrent adapters to leverage heterogeneous corpus for Chinese SRL. Unlike previous methods like finetuning, ours leverage prior knowledge via lateral connections. Experiments have shown that our model yields better performance on CPB than all baseline models. Moreover, we proposed novel gated recurrent adapter to handle transfer on long sentences, The experiment has proved the effectiveness of the new adapter structure.
We believe that progressive learning with heterogeneous data is a promising avenue to pursue. So in the future, we might try to combine more heterogeneous semantic data for other tasks like event extraction and relation classification, etc.
We also release the new corpus Chinese Sem-Bank for Chinese SRL. We hope that it will be helpful in providing common benchmarks for future work on Chinese SRL tasks.